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Abstract 
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Executive Summary 


This document describes the advanced encryption-based solutions developed in MOSAICrOWN 
for protecting confidentiality of data ingested, stored, and processed in a digital data market. It 
addresses different aspects of the problem, to ensure proper data protection while guaranteeing 
access and processing functionality in the different phases of the data life cycle in the digital 
data market (1.e., ingestion, storage, and processing). Leveraging such protection also enables an 
enriched functionality of the data market that can see the involvement, for data storage, access, 
and computation, of parties not fully trusted or not fully authorized for complete data access. The 
document is organized in three chapters. 

Chapter [I] presents our techniques to enable the protection of data in the context of data ana- 
lytics and processing, and — in particular — generic query execution. Our approach supports three 
levels of visibility for parties over the data: plaintext visibility, encrypted visibility, and no visibil- 
ity. It captures and controls the data flows enacted by the execution of a query plan and computes 
an optimal assignment of the different steps of the computation to different parties, based on their 
authorizations as well as their cost. The aim is the identification of an assignment for the execu- 
tion of the operations in the query plan that is both economically convenient and compliant with 
authorizations. Our solution enriches the query plan execution with the application of encryption 
as needed to ensure that parties involved in the computation will be able to (directly or indirectly) 
view only the data they are authorized to access. Encryption will then be injected on-the-fly in the 
query plan execution, dynamically enforcing data wrapping and unwrapping. 

Chapter [2] addresses the protection of data ingested and stored in the data market. It presents 
the adoption of an advanced solution providing All-Or-Nothing Transform (AONT) encryption, 
which ensures that data remain protected in their entirety even when the data market (or part of it) 
is not fully trusted or when the encryption key is leaked or compromised. AONT encryption en- 
ables resource deletion and the dynamic enforcement of access revocation without requiring com- 
plete re-encryption of possibly large resources. Our approach enriches the application of AONT 
with a solution providing for slicing resources so to distribute their, possibly replicated, alloca- 
tion and storage at different parties considering scenarios of data markets leveraging decentralized 
cloud storage services. Our modeling allows the identification of confidentiality and availability 
guarantees provided by the combination of slicing and replication parameters, hence enabling the 
data owner to set them in such a way to best suit the needs of each specific scenario. 

Chapter [3} presents the design of a solution leveraging encryption for enabling data owners to 
wrap data with a self-protecting layer when ingesting them in the data market. The approach is 
based on the use of hierarchical encryption and key derivation to provide efficient key manage- 
ment. It enables owners to ingest an encrypted/wrapped version of their data in the data market 
and enables consumers to access them by properly demonstrating their ability to derive the data 
decryption key. Our solution also puts forward the idea of combining encryption and key man- 
agement with blockchain and smart contract technologies, towards the realization of a data market 
empowering owners to realize economic incentives when making their data available to others. 
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1. Dynamic wrapping for collaborative 
computations 


In this chapter, we discuss the problem of enabling efficient collaborative evaluation of compu- 
tations over data managed by the cloud market, owned by different parties. The data market 
represents a promising solution for combining data from different sources, to the aim of extract- 
ing useful information. Since computation of large data collections can be expensive, we take 
into consideration the possibility for the data market to participate and possibly partially delegate 
query evaluation to third parties, whenever it provides an economic advantage while not violating 
the need for protection of confidential information by data owners. Clearly, the release of informa- 
tion by data owners for collaborative computations needs to be regulated according to the owners” 
requirements, making data release selective. 

In this chapter, we illustrate a solution aimed at enabling the involvement of external third par- 
ties in query evaluation, while guaranteeing the satisfaction of the authorization policy regulating 
data release defined by data owners. The approach illustrated in this chapter builds on the pro- 
posal in [DFJ +17}, enabling the support of a larger number of operators and properly managing 
(attribute) rename operations. 

The remainder of this chapter is organized as follows. Section [LT] presents related works 
and the innovation of MOSAICrOWN. Section [1.2] introduces the preliminary results on which 
our solution builds. Section [1.3] illustrates the problems related with authorization enforcement 
when attributes are renamed. Section[1.4]defines the relation profile resulting from the evaluation 
of rename and set operations, and of arithmetic expressions. Section [1.5] discusses authorization 
enforcement, and the computation of the assignment of operations in a query plan that minimizes 
query evaluation costs, while satisfying authorizations. Section|I.6]concludes the chapter. 


1.1 State of the art and MOSAICrOWN innovation 


In this section, we illustrate the state of the art and the innovation produced by MOSAICrOWN for 
collaborative computations over data stored in the market, possibly owned by different authorities. 


1.1.1 State of the art 


The problem of managing queries in distributed scenarios has been extensively studied, but tra- 
ditional solutions (e.g., LSK95]) as well as modern approaches that consider big data 


analytics (e.g., [AAC*18||AXL*15\/RLG17]) do not take into consideration access restrictions. 


In the relational database context, access restrictions can be supported by views (e.g., [DFJ*14 
GB14|RMSR04]), access patterns (e.g., [AB18/[BLT15])), or data masking (e.g., [KB16]). Such 


proposals however do not consider encryption. 
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Different works have addressed the problem of protecting data confidentiality in distributed 
computations (e.g., ([DFJ* 11||OKM17||SKS* 19||ZZL* 15}). In the authors present an 
approach to collaboratively execute queries on data subject to access restrictions, considering dif- 
ferent join evaluation strategies. In the authors propose an operator placement approach 
aimed at satisfying privacy constraint, while maximizing performance in query evaluation. The 
proposed solution relies on programming language techniques for regulating and controlling in- 
formation flows. In the authors provide a solution for restricting access and sharing of 
distributed data, which supports the explicit consideration of join paths in the authorizations. The 
proposal in aims to protect MapReduce computations in hybrid clouds, preventing flows 
of sensitive information to the public cloud. These works confirm the relevance of the problem 
but focus on different aspects. None of the proposals considers the possibility of protecting data 
with encryption. The idea of specifying different visibility levels over data has been first proposed 
in [DFJ*17]. The approach in integrated this authorization model in a distributed query 
optimizer. 

Several works (e.g., [AAKL06|[HIML02|/PRZB 11|[TKMZ13]) have investigated the use and 
support of encryption for the protection of data in storage or query execution. Other approaches 
(e.g., [BEE*17|[CLS09)) proposed solutions for using secure multiparty computation in query 
evaluation, to keep both the input operands and the result secret to the party in charge of query 
evaluation. 


1.1.22 MOSAICrOWN innovation 


MOSAICrOWN produced several advancements over the state of the art, which are discussed in 
this section. 


e The first innovation is represented by the analysis and support of additional operators (i.e., 
rename operation, set operations, and arithmetic expressions) with respect to the ones con- 
sidered in the original proposal. 


The second innovation consists of the definition of an extended relation profile, which per- 
mits to keep track of derived attributes (i.e., attributes resulting from a rename operation or 
from the evaluation of an expression). The verification of releases is therefore revised and 
extended to take into account such an extended relation profile to manage attribute names 
that have not been used in the definition of authorizations. 


Some of the results obtained by MOSAICrOWN and illustrated in this chapter have been 
published in [BDF* 19a]. A preliminary version of the proposal illustrated in this chapter has been 
implemented by one of the tools presented in deliverable D4.1 “First version of encryption-based 
protection tools” [FL20]. 


1.2 Preliminaries 


For concreteness, we frame our work in the context of the execution of queries over relational 
tables, performed according to a query plan represented as a tree whose leaves are base relations 
and whose non-leaf nodes are operations to be executed to perform the query. 

The two building blocks of the approach illustrated in this chapter are the authorization model 
regulating data visibility, and the definition of relation profiles capturing the informative content 
(directly or indirectly) conveyed by relations. 
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Figure 1.1: An example of a query plan enriched with relation profiles, assignees for the execution 
of its operations, and encryption/decryption operations (a), and of a set of authorizations (b) 


Authorization model. According to the authorization model proposed in [DFJ*17], authoriza- 
tions specify whether a subject can have, for an attribute: 


e plaintext visibility: the subject has complete visibility on the values of the attribute; 


e encrypted visibility: the subject cannot view the plaintext values of the attribute, but can 
view an encrypted version of them; 


e no visibility: the subject can view the values of the attribute neither plaintext nor encrypted. 
Authorizations are then defined as follows. 


Definition 1.2.1 (Authorization) Let R be a relation and 4 be a set of subjects. An authorization 
is a rule of the form |P,E|-+S, where PCR and ECR are subsets of attributes in R such that 
PNE=0, and Se 9 Uf{any}. 


Assuming a closed policy, a subject can hold at most one authorization for each relation, stating 
which attributes (P) she can view in plaintext and which attributes (E) she can view encrypted. The 
subject cannot view any other attribute in the relation schema. Clearly, if S can access an attribute 
a in plaintext she can also access its encrypted version. The consideration of encrypted visibility 
enables subjects who are not trusted to access the sensitive data content to perform computations 
over them in encrypted form. 

Figure [L.1[b) illustrates a set of authorizations over relations Hospital(SSN, BirthDate, Dis- 
ease, Treatment) and Insurance(Customer, Premium) for a set of six subjects H (owner of relation 
Hospital), I (owner of relation Insurance), U (the user posing the query), X, Y, and Z (subjects 
offering computational capabilities), along with the default authorization for ‘any’. In the figure, 
attributes are denoted by their initials and, for the sake of readability, in the authorizations we de- 
note a set of attributes simply with the sequence of the attributes composing it, omitting the curly 
brackets and commas (e.g., SBDT stands for {S,B,D,T}). 
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Relation profiles. A relation resulting from a computation can convey information on attributes 
not explicitly appearing in its schema. The profile of a relation includes: visible attributes that 
appear in the relation schema (in plaintext or encrypted), implicit attributes taken into account in 
the computation of the relation (in plaintext or encrypted), and equivalence relationships among 
attributes compared in the computation. Intuitively, implicit attributes are all those attributes that 
appear in a selection condition or grouping operation in the (sub-)query producing the relation. For 
instance, attribute B is implicit in the profile of the relation resulting from the evaluation of query 
“SELECT A FROM R WHERE B=‘10’”. Attributes are considered equivalent if they have been 
compared in a computation and therefore visibility of one attribute indirectly leaks the other(s). 
For instance, attributes A and B are equivalent in the profile of the relation resulting from the 
evaluation of query “SELECT A FROM R JOIN R? ON A=B”. 

The profile of the relation resulting from the evaluation of an operation depends on the operator 
and on the profile of its operands. Figure [T2]illustrates the profile resulting from the evaluation of: 
projection, selection, cartesian product, join, and group by operators. For each operator, the figure 
reports the general formula on the left and a simple example on the right. In the figure, the profile 
of a relation is represented as a tag attached to the node generating it, where v denotes the visible 
component, i the implicit component, and ~ the equivalence relationships. Encrypted attributes 
are represented on a gray background. The figure also reports encryption (gray box on top of the 
operand relation) and decryption (white box below the subsequent operator) operations. 

Since profiles capture the informative content of relations, they can be used to verify the sat- 
isfaction of authorizations when releasing a relation, authorizing a subject for a relation based on 
whether she can legitimately access all the informative content carried by a relation, and for the ex- 
ecution of an operation in a query plan if she is authorized for the input and output relations. Given 
a query plan, the solution illustrated in injects encryption and decryption operations to 
enable different subjects to execute different operations, in full respect of the authorizations. 


1.3 Derived attributes 


The preliminary model illustrated in Section[I-2]supports the execution of computations involving 
projections, selections, cartesian products, joins, and group by operations. Computations involving 
other operations, such as set operations (union U, intersection N, and difference \) and arithmetic 
operations, could not be handled. The consideration of such additional operations introduces new 
challenges that need to be carefully addressed, mainly due to the possibility of introducing, during 
the computation, derived attributes. Derived attributes are attributes not appearing in any original 
relation schema and obtained through either a renaming operation or an arithmetic/set operation, 
with new names dictated by the computation itself. The release of a relation with derived attributes 
clearly discloses them, even if they do not appear in the original relation schemas. While such 
derived attributes might resemble the concept of implicit attributes investigated in the preliminary 
model, unfortunately they cannot be directly managed as such. In particular, the information 
leakage due to derived attributes complicates the enforcement of the authorization policy, since 
authorizations are defined over the original attributes appearing in the relation schemas and do not 
regulate the release of attributes with different (new) names. For instance, query “SELECT A AS B 
FROM R” reveals the values of attribute A under name B, but no authorization regulates the release 
of B since B does not belong to the original schema of R. Clearly, the authorizations originally 
defined over A must apply also to B, since A and B are two different names for the same attribute. 

In MOSAICrOWN, we have investigated this issue and proposed a solution for correctly man- 


$A MOSAICrOWN Deliverable D4.2 


Section 1.3: Derived attributes 


15 


General formula 


0: REPO A RUA: 
2 “œ~: RE 
+ r 
5 an 
a ... A 
S “q: RYP pue 
a ¡E i RP RE 
eS i: RYP aan i ¿pue ; 
Caor) & R'PU(RYPN{a}) RIERUAAJ: 
“v: RPR 
pip [pte 
2 ee RE 
T “vu: RYP NS 
wn D E . 
Coro) À ini? Re 
8 Ur RPGRY puegpue 
S ¿o RIPUR,” RURY 
E Cx) “E RPURE Mg: 
m E pan an ` 
A MR a] í 
g 
Kei 
n 
Q 
2 
~ 
a 
O 


Join 


¿Cv BYP UR? RURY : 
i RPUR? RURE: 
‘œ: RFURFU{ai, aj} 


PR: 


Group by 


¿e RPA(AULa}) RENAD : 
` i RPU(RYPnA) REORG) : 
¿RS j 


a ROD Re i 


Encryption 


Decryption 


Figure 1.2: Graphical representation of the profiles resulting from relational and encryption/de- 


cryption operations 


di MOSAICrOWN 


Deliverable D4.2 


16 Dynamic wrapping for collaborative computations 


aging derived attributes. This leads to a more complete model that allows for more complex oper- 
ations involving, besides basic relational operators, also advanced operators such as set operators 
and arithmetic expressions. In the remainder of this chapter, we illustrate how derived attributes 
can be captured in relation profiles, how they can be impacted by the operations involved in a 
query, and how the authorizations can then be enforced when assigning query operations to sub- 
jects for their execution. 


1.4 Relation profile 


We define the profile of a relation to capture the informative content carried by the relation in terms 
of attributes explicitly as well as implicitly visible and taking into account information conveyed 
by equivalent and derived attributes. We refer to attributes explicitly visible in a relation as visible 
attributes, and to those implicitly leaked as implicit. In addition, attributes can be represented in 
plaintext or encrypted. In the following, we will refer to non-derived attributes (1.e., attributes 
appearing in the original relation schemas) as base attributes. 


Definition 1.4.1 (Relation Profile) The profile of a relation R is a 6-tuple of the form 

[RP R"e RÈ RI? RE RP] where: RP and R! are the visible attributes appearing in R’s schema 
in plaintext (R'””) or encrypted (R"*) form; R? and R" are the implicit attributes conveyed by R, in 
plaintext (RP) or encrypted (R!) form; RX is a disjoint-set data structure representing the closure 
of the equivalence relationship implied by attributes connected in R’s computation; and R?” is 
a set of pairs |a,A| expressing the fact that attribute a has been derived from the set A of base 
attributes. 


Intuitively, the profile of a relation is extended to keep track of the correspondence between 
the attributes derived from a computation (i.e., an aggregate function or an arithmetic expression) 
and the attributes from which they have been obtained. Such a correspondence is needed to verify 
whether these new attributes can be released to a given subject as this depends on the base attributes 
(which are the only attributes explicitly mentioned in the authorizations) from which they have 
been computed. 

Even if a new attribute a can be derived from both base attributes and derived attributes, 
we can easily look at R? component as composed of set of pairs [a,A] where A includes base 
attributes only. Indeed, the R? component of a relation profile can be easily used to translate 
a pair [a,A] into a new pair [a,A”] where A’ includes only base attributes. For instance, assume 
that R’={[a3,{a1,a2}]} and that attribute as derives from attributes a3 and a4, meaning that the 
pair [as,{a3,a4}] must be added to R””. In this case, since attribute a3 derives from {a1,a2}, then, 
by transitivity, as derives from (a,,a2,a4), and pair [as, {a1,a2,a3}] can then be added to R?. 
Formally, given an attribute a and component R”, we define a function @(a,R~) that returns A if 
R?” includes a pair [a,A]; it returns a, otherwise. In the following, with a slight abuse of notation, 
@(A,R~) will denote the application of function @ to each attribute in A. 

The profile of a base relation has all the elements but RY empty since it is assumed accessible 
in plaintext and does not carry any implicit content or equivalence/renaming relationship. (Note 
that plaintext accessibility of a relation does not imply that it is stored in plaintext but only that it 
is accessible in plaintext by its owner.) Formally, the profile of a base relation R(a;,...,a,) is then 
A A 

The profile of the relation resulting from a query depends on the profile of the operand relations 
and on the operators involved in its computation. Every operator only operates on visible attributes 
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(i.e., attributes in RP and R”*, which belong to the schema of the operand relation R), but it may 
affect also implicit, equivalent and derived attributes in the profile of the resulting relation. The 
approach in defined the profiles resulting from the application of projection, selection, 
cartesian product, join, and group by (this latter with the simplifying assumption of maintaining 
the same names to attributes over which the aggregate function operates), assuming a simplified 
profile that does not include component R””. Indeed, selection, projection, cartesian product, and 
join do not affect the R? component of profiles, which remains unchanged from the operands to 
the operation result. 

The presence of component R?” in relation profiles permits to easily support the use of rename 
and set operators, as well as of arithmetic expressions, as 1llustrated in the following. 


Renaming (p). Renaming changes the name of a subset of the (plaintext or encrypted) visible 
attributes of a relation. Since the resulting relation has the same schema as the operand apart from 
the renamed attributes, the visible component of the profile of the result reflects such a change. 
Hence, the visible component of the profile of the result is the same as the one of the operand, 
apart from the fact that the attribute a on which the rename operator is applied is substituted with 
its new name a’ in the plaintext or encrypted component. The implicit attributes and equivalence 
sets are the same as the ones of the operand. Note that if attribute a appears among the implicit 
attributes or in an equivalence set in the operand relation, it is not replaced with its new name a’. 
For each renamed attribute, pair [a’, @(a,R’)| is added to the set of renamed attributes. Note that 
component R?” keeps track of the correspondence between the new name a’ of the attribute and 
the name of the corresponding attribute in base relations, obtained evaluating œ(a,R? ). Indeed, 
if a’ is the new name for a and a is not a name in a base relation, a is substituted with w(a,R”>) 
before populating R” component in the relation profile. Consider a rename operation R = Py. ¿Ri 
operating on relation R; with profile [R)”, Ry? ¿RE , Ri’, R=,R)’]. The profile of R is defined as 
follows (see Figure[1.3): 


e RP = RP U {a'} Lay if acR)”, R’?=R/P otherwise; 


RY’ = RY? U {a'} \ {a} if aER;”, R'*=Rj" otherwise; 


ip_piP. 
R°=R; ; 


R&= E 


R™=R_; 
e R° =R? U|d',@(a,RP )]. 


Figure[I.3|illustrates an example of the profiles resulting from two rename operations, renam- 
ing plaintext attribute B (left-hand side of the figure) and encrypted attribute T (right-hand side of 
the figure) to the new attribute name K. 


Set operators (U, N, 1). Set operators are binary operators that work like the corresponding 
operators from mathematical set theory. They combine the relations resulting from two (or more) 
queries into a single result set, returning the tuples that belong to: at least one between R; and 
R, (union U), both R; and R, (intersection N), R; but not R, (difference 1). According to SQL 
standards, we assume that the schema of the resulting relation is the one of the first operand, that 
is, R;. Note that R; and R, must have the same number of attributes in their schema to enable 
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Figure 1.3: Graphical representation of the profile resulting from rename operation 


the evaluation of a set operator. Since set operators produce as a result a relation with the same 
schema as the first operator R;, the result has the same visible attributes as R;. However, since the 
result conveys information about both the operands, the implicit attributes, the sets of equivalent 
attributes, and renamed attributes are the union of the corresponding components (i.e., sets of 
attributes) in the profiles of the operands. Also, for each pair of attributes a), in R; and a, in R, 
appearing in the same (i-th) position in the schema of the operand relation, equivalence {aj,,a,, } 
is added to the equivalence set. Indeed, the i-th attribute in the schema of the result is obtained by 
comparing and/or combining the values of the i-th attribute a), in the schema of R; and the values 
of the i-th attribute a,, in the schema of R,. More precisely, union operator appends the values 
in a, to the ones in a,,, while intersection and difference operators compare the values in a), with 
the ones in a,,. It is interesting to note that even if intersection and difference operators reduce 
the number of tuples in the resulting relation with respect to the number of tuples in its operands, 
their implicit information (and hence their profiles) are richer than those of its operands singularly 
taken. For instance, consider a relation R resulting from the difference R = R; \ R,. Clearly, R 
strongly depends on R, even if no tuple in R, appears in R (and hence its implicit components 
need to consider those of both R; and R,). 

Consider operation R = R; set_op R,, with set_op € {U,M,\} and operating on relations R; and 
R, with profile [RIP Ry, RP , Rie, R=,R;’] and [R , RY? „R? Ri’, R= ,R;], respectively. The profile 
of R is defined as follows (see Figure[1.4): 


e RP =R”; 

e RY = ie 

e R? =RP URP; 

e Re = RI U RY; 

e R~ = RF URS U {{ay,, ay} : ay, € RP URY ay, € RP URY i= 1,..., |R? ORY |}; 


e R° =RPUR?. 
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Figure 1.4: Graphical representation of the profile resulting from set operators 


Note that the representation of attributes a;, and a,, must be consistent, that is, the attributes must be 
both plaintext or both encrypted for the evaluation of set operators. Figure[I 4ļillustrates the profile 
resulting from the execution of the union operator over two relations with schemas composed 
of two attributes. The resulting profile, while keeping in its schema the attributes of the first 
operand, includes in its equivalent component the fact that SK and TP have been connected in the 
computation (where K is a renamed version of attribute C). 


Arithmetic expressions and aggregate functions. The result of a query can include, besides a 
list of attributes, also the result of an arbitrary arithmetic expression or of an aggregate function 
over a set of attributes. The query associates a name a with the result of the arithmetic expression 
Exp defined over a set A of attributes, or of the aggregate function Agg operating over a set A of 
attributes. We denote such operations with notations Pa+£xp(R) and Y a. p(y) (R), respectively. 
Note that the attributes in A must be all encrypted or all plaintext for the evaluation of the arithmetic 
expression or of the aggregation function. Clearly, iff an operation requires to operate on plaintext 
values, attributes in A must be plaintext. 

Consider the evaluation of arithmetic expression Exp defined over a set A of attributes to which 
the query associates name a, denoted PasE£xp(Rı). The evaluation of R = Paexp(Ri), operating 


on relation R; with profile [R)”, Ry? JR”, ie R-,R) |, produces a relation R with profile: 


e R? = RP \AU {a} if ACR)”, R = R? otherwise; 


RY’ = RY? \AU {a} if ACR}, R”? = R° otherwise; 


5 RP = RP; 
e Re = a 
e RT =R; 


e R° = R? U |a O(A,R;”)] 


Figure illustrates an example of the profiles resulting from two arithmetic expressions, 
renaming to a new attribute with name K the sum computed over two plaintext (left-hand side of 
the figure) and two encrypted (right-hand side of the figure) attributes T and P. 
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Figure 1.5: Graphical representation of the profile resulting from arithmetic expressions 


Consider now the evaluation of an aggregate function f defined over a set Y of attributes and 
computed over groups of tuples with the same values for attributes in X, denoted Yy a py) (R1). 
The evaluation of R = Yy a+ f(r) (Ri), operating on relation R; with profile [R¡”, Ry, RP R}, RY, RP], 
produces a relation R with profile: 


e R”=(RPNX)U {a} if Y C RP, R? = (RP NX) otherwise; 
e RY = (RP OX) U {a} if Y C RY, R”? = (R° O X) otherwise; 


e R? =R? U (RP OX); 


e R? = R? U (RY NX); 
e R“ = RT; 
e R° = RP Ula, @(Y,R7’)).- 


Figure [L6] illustrates an example of the profiles resulting from two aggregate functions, re- 
naming to a new attribute with name K the average computed over the plaintext (left-hand side of 
the figure) and encrypted (right-hand side of the figure) attribute P. 


1.5 Authorization enforcement 


In this section, we illustrate how an authorization policy can be enforced (Section|1.5.1) and how 
the execution of the operations entailed by a query can be assigned to subjects in the respect of 


such policy (Section|1.5.2). 


1.5.1 Authorized visibility 


Given a relation R with profile [R””, R"*, RP, R'*, R=_R7], the release of R to a subject S depends on 
the set of authorizations granted to S. However, authorizations are defined only on base attributes, 
that is, those appearing in the base relations. Consistently with the assumption of operating under 
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Figure 1.6: Graphical representation of the profile resulting from aggregate functions 


a closed policy, the release of any derived attribute in R would be denied. This situation is clearly 
unacceptable, since the values of the derived attributes depend on the values of the base attributes 
from which they have been (directly or indirectly) computed. Hence, we need to define a way for 
regulating their release while operating with authorizations including only base attributes. Our ap- 
proach relies in the translation of relation profiles into simplified relation profiles, that is, relation 
profiles defined over base attributes only. To this end, we leverage function @(a,R~) (Section[1.4) 
to replace derived attributes with the corresponding base attributes on which they have been de- 
fined. Hence, a simplified relation profile is composed of five (rather than six) components, since 
component R? is omitted and its content used to transform the other components in such a way 
to include base attributes only. Given a relation profile [R'”, R'* ¿RP JE RE RA, its simplified 
profile is obtained by: 


1. substituting each occurrence of attribute a in RYP, R**, RP Ri? and R~ with o(a,R”~); 
2. inserting A into R~ for each pair [a,A] in R”. 


The first step simply substitutes each derived attribute in the relation profile with the set of 
attributes from which it has been derived. The second step instead includes additional equiva- 
lences among attributes in the R~ component. This is needed to impose that a subject can access 
an attribute a derived from a set A of other attributes only when it has the same visibility on all 
attributes in A (more details on this uniform visibility requirement will be illustrated in the remain- 
der of this section). For instance, assume that attribute a;; is defined as the sum of attributes a; and 
aj. The visibility of a subject S over a;; should be the minimum between her visibility over a; and 
her visibility over a; (i.e., if S can access a; plaintext and a; encrypted, she cannot access their 
sum in plaintext as she could reconstruct a;). However, we also require the same visibility over all 
attributes involved in a computation because, for example, even the encrypted visibility over the 
result of a computation, combined with the plaintext visibility of one of its operands, may cause 
improper information leakage. 

Formally, the simplified profile of a relation is defined as follows. 
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Definition 1.5.1 (Simplified Relation Profile) Let R be a relation with profile [R"”, R"*, RP, R*, R=, R°]. 
The simplified profile of R is a 5-tuple of the form [R"?,R"*, RY, RE R=] where R? = @(R'? ,R~), 

RY = @(R”,R”), R?, = @(R?,R”), RY, = @(R”,R~), and R=, = w(R~,R~) U{A,Va,A] € 

Ro). 


The simplified profile of a relation is equivalent to its relation profile (meaning that it conveys 
the same informative content), since it replaces the names of derived attributes with the names of 
the base attributes on which they depend. Note that, given a query plan, the simplification of the 
profile of the relation R, resulting from the evaluation of a node n, can be performed only after 
the evaluation of the operator represented by the parent of node n, (i.e., the profile of the parent of 
nx should operate on the original, non simplified, profile), because such an operator might operate 
over the derived attributes. For instance, consider the relation profiles resulting from the aggregate 
functions in Figure the aggregate avg(P) (regardless of P being plaintext or encrypted), is 
renamed to K, and hence possible subsequent operations are expected to be defined over the newly 
introduced K. 

Having defined simplified relation profiles, including only attributes over which authorizations 
have been defined, we can verify the satisfaction of the authorization policy by directly comparing 
simplified profiles and authorizations. For the sake of readability, in the following we denote with 
notation Zs (65, resp.) the set of attributes that S can access in plaintext (in encrypted form, 
resp.). To verify the satisfaction of the authorization policy in the release of a relation, we define 
the concept of authorized relation as follows. 


Definition 1.5.2 (Authorized Relation) Let R be a relation with simplified profile 
[R2 R"? RP, RË? R]. A subject S is authorized for R iff: 


1. RPURY, C Pg (authorized for plaintext); 


2. RURY C PsUB&s (authorized for encrypted); 


3. VAER”, AC Ps or ACés (uniform visibility). 


According to Definition [1.5.2] a subject S is authorized to access a relation R iff: 1) S is au- 
thorized to access in plaintext all the (visible or implicit) attributes represented in plaintext in R; 
2) S is authorized to access in plaintext or in encrypted form all the (visible or implicit) attributes 
represented in encrypted form in R; 3) S is authorized to access in the same form (either plaintext 
or encrypted) all the equivalent attributes, that is, attributes that appear together in an equiva- 
lence set in R“ (uniform visibility). Note that the last condition enforces a control on the indirect 
information flows caused by equivalence relationships produced in the evaluation of the query 
computation, to prevent unauthorized exposure of information. Clearly, the condition requires that 
a subject S be authorized to access all equivalent attributes. The condition also imposes that S 
enjoys the same (plaintext or encrypted) visibility over all attributes in an equivalence set, for all 
sets. This prevents the information leakage that may flow to a subject with different visibilities 
allowed over the attributes in a set: to illustrate, consider a relation with [a¡,a2,_,_,fa1,a2)), and 
a subject S allowed to access a; plaintext and az encrypted. If S were to be given access to the 
relation, she might leverage her plaintext visibility over a; to determine the (plaintext) values of 
a2, violating the authorizations. Requiring uniform visibility prevents such inference channel. 

Definition|I.5.2]illustrates the conditions that must be satisfied by a subject to access a relation. 
Our problem however concerns assigning subjects the execution of an operation in a query tree 
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plan. Each operation takes as input a relation (or two relations, in case of join, cartesian product, 
and set operators), and produces a relation. A subject is authorized to perform an operation if it is 
authorized to access both its operands and its result. Indeed, the profile of the relations input and 
output of an operation capture all the information flows caused by the evaluation of the operation. 


1.5.2 Assigning operations to subjects 


While original query plans assume all the attributes to be presented in plaintext and do not in- 
clude encryption/decryption operations, our authorization model distinguishes between two possi- 
ble views over the data: plaintext and encrypted. We then leverage on-the-fly encryption to adjust 
the visibility over the attributes to allow subjects authorized for encrypted visibility only to partic- 
ipate in query evaluation. Indeed, on-the-fly encryption can be used to hide plaintext values of an 
attribute when the subject involved for a computation over it can access it in encrypted form only. 
Similarly, on-the-fly decryption can be adopted whenever dictated by operation execution. Indeed, 
some operations cannot be evaluated over encrypted data. Also, set operators specifically require 
that corresponding attributes in the operand relations be represented in the same form to enable 
the evaluation of the operation. The evaluation of intersection and difference requires to compare 
the tuples in the two operands to compute the result. Even if the evaluation of union operator does 
not require such a comparison, it would still not be possible to operate on the resulting relation 
where different tuples have a different representation form for a same attribute. 

We enrich query plans with encryption and decryption operations, so to adjust attribute visi- 
bility as demanded by authorizations enforcement and operations execution. We refer to a query 
plan enriched with on-the-fly encryption and decryption operation as an extended query plan. To 
illustrate, consider the two relations Hospital and Insurance illustrated in Section [1.2] and the fol- 
lowing query: “SELECT T, avg(P) FROM HOSP JOIN INS ON S=C WHERE D=‘stroke’ GROUP BY 
T HAVING avg(P)>100” retrieving, for each treatment given to patients hospitalized for stroke, 
the average insurance premium (if greater than USD100). Figure [ita illustrates an example of 
an extended plan for this query. Encryption and decryption operations are denoted with grey and 
white boxes, respectively, surrounding the encrypted/decrypted attributes. The plan in the figure 
includes two encryption operations (over S and over CP before the computation of the join) and 
one decryption operation (over P before the computation of the last selection Oavg(P)<100). Nodes 
in the figure are also enriched with the simplified profiles resulting from their operations, and with 
the subjects in charge for their execution (reported on the left-hand side of each node). 

Given a query plan, there can exist several extended query plans and, for each extended query 
plan, different assignments of subjects to operations that satisfy authorizations. An extreme strat- 
egy consists in keeping all attributes plaintext, but such a solution limits the number of subjects 
to which each operation can be assigned. On the other hand, keeping all the attributes encrypted 
would prevent the evaluation of operations requiring plaintext visibility over attributes. To deter- 
mine an extended query plan that enables the identification of subjects that can perform operations 
without violation authorizations, we associate each node in the query plan with its minimum re- 
quired views. The minimum required view associated with a node represents the profile of the 
node obtained assuming that all the attributes are encrypted with the exception of those needed 
plaintext for the evaluation of the operation. Intuitively, minimum required views characterize 
operation requirements that can limit or prevent the adoption of encryption. All subjects that are 
authorized for the minimum required view associated with a node can be considered candidates for 
executing operation (i.e., they can be assigned the evaluation of the operation in the respect of au- 
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thorizations). Operations can be assigned to any candidate, since visibility over the attributes can 
always be adjusted by inserting on-the-fly encryption operations as demanded by authorizations, 
without affecting the evaluation of the operation. For instance, with reference to the extended 
plan in Figure [1.1] the execution of the join operation can be assigned to subject X thanks to the 
fact that attributes S, C, and P (over which X is only authorized for the encrypted visibility) are 
encrypted before the join execution. 

Several strategies can be adopted for choosing the most suitable candidate for each operation 
in a query plan. If the cost of encryption can be considered negligible, any query optimizer can be 
used for selecting the best candidate. Encryption and decryption are then inserted when needed 
to regulate visibility and to allow operation execution. When instead the impact of encryption 
and decryption operations is expected to be non-negligible (for instance, when adopting advanced 
encryption schemes that are computationally intensive and which could significantly impact the 
size of the input data), those costs should be taken into consideration to the aim of finding an 
assignment of operations to candidates that minimizes the overall execution costs. This latter 
strategy is currently under investigation, to the aim of defining appropriate cost functions able to 
consider encryption and decryption costs in the computation of the most suitable candidate for 
each operation in the considered query plan. 


1.6 Summary 


This chapter focused on the problem of collaborative computations for analyzing data from differ- 
ent sources in the data market, assuming different trust by the data owners in the different subjects 
involved in the computation. The approach illustrated in this chapter enables the data market to 
rely on different providers for delegating the analysis and combination of data stored in the mar- 
ket, possibly owned by different authorities. The proposed solution leverages on the possibility of 
expressing authorizations regulating the (plaintext or encrypted) visibility to enable collaborative 
query evaluation. 

MOSAICrOWN is now studying a novel solution for minimizing the economic cost of collab- 
orative query evaluation, taking into consideration the overhead caused by encryption and decryp- 
tion operations. 
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In this chapter, we discuss the problems of storing and managing resources in the data market, 
providing availability and security guarantees in scenarios where the data market relies on De- 
centralized Cloud Storage (DCS) platforms. DCS services represent a promising opportunity for 
the data market for storing resources. DCS services rely on the availability of multiple nodes that 
can be used to store resources in a decentralized manner. In such services, individual resources 
are fragmented in slices allocated (with replication to provide availability guarantees) to different 
nodes. The main characteristics of a DCS is the cooperative and dynamic structure formed by in- 
dependent nodes (providing a multi-authority storage network) that can join the service and offer 
storage space, typically in exchange of some reward. The use of a dynamic network of unknown 
(and potentially non-trusted) peers clearly introduces the problem of guaranteeing proper protec- 
tion to resources, ensuring their availability and security (for both confidentiality and integrity). 
In fact, a DCS is a potentially unstable network, and hence continuous participation of every sin- 
gle node cannot be assumed. Given this, and in line with the distributed nature of DCS services, 
resources are sliced into many slices, with different slices allocated to different nodes of the net- 
work, with replication to guarantee availability. Reconstruction of a resource requires collecting 
the different slices composing it. Also, nodes participating in the DCS - which can dynamically 
join and leave and are anonymous - cannot be considered trusted, hence resource confidentiality 
needs to be protected against each of them (as well as against possible coalitions of malicious 
nodes) and the data market should be able to assess integrity of the slices (and hence resources). 

In this chapter, we present a solution to enable data markets to securely store the resources it 
manages in DCS services. While client-side encryption provides a first crucial layer of protection 
in DCS, it leaves resources exposed to threats, especially in the long term. For instance, resources 
are still vulnerable in case the encryption key is exposed (e.g., as a consequence of access revoca- 
tion), or in case of malicious nodes not deleting their slices upon request for resource deletion. For 
this reason, we leverage the protection guarantees offered by Al/-Or-Nothing-Transform (AONT). 
We devise an approach to carefully control resource slicing and allocation to nodes in the DCS 
network, with the goal of ensuring both availability and security. We rely on replication and on 
erasure codes to guarantee availability, that is, the ability for authorized users to retrieve all the 
slices to reconstruct the resource. We rely on AONT to guarantee security, that is, to protect against 
malicious parties jointly collecting their slices to prevent resource deletion or attempt decryption 
in case of key exposure. In this chapter, we use the term slicing to refer to the cutting of a resource 
and the term slices to refer to the result of such a process. A slice is therefore a chunk of the 
resource and represents a unit of allocation. 

The remainder of this chapter is organized as follows. Section [2.1|presents related works and 
the innovation of MOSAICrOWN. Section[2.2| introduces the basic concepts. Section[2.3|defines 
the properties of a decentralized allocation function with respect to replication and protection. 
Section|2.4]discusses slicing and allocation strategies. Section[2.5]illustrates availability and secu- 
rity guarantees and discusses the setting of parameters guiding slicing and allocation. Section [2.6] 
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presents the combined adoption of AONT and error correcting code techniques to provide for bet- 
ter availability guarantees. Section [2.7] analyzes the availability and security guarantees obtained 
when combining AONT and fountain codes. Section|2.8|concludes the chapter. 


2.1 State of the art and MOSAICrOWN innovation 


In this section, we illustrate the state of the art and the innovation produced by MOSAICrOWN 
for the storage of resources managed in the data market, relying on DCS services. 


2.1.1 State of the art 


RAID |PGK88] is one of the main contributions aimed at the construction of reliable systems. 
RAID is normally deployed on local drives. With the advent of the cloud, RAID has been extended 
to take adversarial failures into consideration. Along this line of works, HAIL (High-Availability 
and Integrity Layer) extended RAID with multiple cloud storage providers and a Proof 
of Retrievability (POR) scheme to verify that a provider still holds a certain piece of 
information. HAIL is however not well-suited for DCS systems. Also, HAIL does not take into 
account the possibility of adversarial users trying to reconstruct resources for their personal profit. 

Many DCS networks that have recently been proposed already include a certain degree of secu- 
rity guarantees (i.e., protection against malicious parties jointly collecting all the slices composing 
a resource). Among them, Stor] and Sia adopt client-side encryption and do 
not protect the outsourced data against coalitions of malicious nodes. SAFE Network in- 
stead adopts a self-encryption technique: the resource is divided into shards and a weak AONT 
among three shards is applied before uploading them. In the design of the SAFE Net- 
work and the possible attack vectors are analyzed. The solution proposed in is 
predetermined and the interaction between redundancy and security is not analyzed. 

Another related line of works is security of outsourced data (e.g., 
(TPPG13}), which can be improved using AONT. Existing solutions however consider domains 
different from DCS. 

A precursor of DCS is represented by P2P systems. The P2P system closer to our proposal, 
which considers reliability and security, is Tangler [WMO]]. The goal of Tangler is censorship 
resistance, which is a potential application of DCS, but not its main goal. A characteristic of 
Tangler is the use of Shamir’s method, which is quite expensive in terms of storage and bandwidth. 
Also, it does not aim at combining availability and confidentiality requirements in data allocation. 

The combined adoption of AONT and error-correcting code techniques has been recently ex- 
plored to the aim of protecting outsourced data against possibly curious storage providers and 
offering high performance (e.g., [BMMCI4|[RP11]). The proposal AONT-RS combines 
Rivest’s AONT with Reed-Solomon, while AONT-LT uses Luby Transform 
code (which is a class of fountain codes [CSGM06j[Sho11]) instead of Reed-Solomon. These 
proposals are specifically focused on static scenarios, where the set of nodes is fixed and nodes 
are not expected to frequently leave/join the network. These solutions are then not suited to DCS 
scenarios. 


2.1.2 MOSAICrOWN innovation 


MOSAICrOWN produced several advancements over the state of the art, which are discussed in 
this section. 
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Figure 2.1: Reference scenario 


e The first innovation is represented by a technique enabling the data market to control re- 
source slicing and allocation to nodes in a DCS, used for data storage, in such a way to 
guarantee both availability and security. The proposed solution leverages the guarantees 
offered by AONT, combined with resource slicing and slice replication. 


The second innovation is represented by the definition of different strategies for slicing and 
distributing resources across the decentralized network, and the analysis of their charac- 
teristics in terms of availability and security guarantees. The modeling of the slicing ad 
allocation problem enables the data market to control the granularity of slicing and the di- 
versification of allocation to ensure the aimed availability and security guarantees. 


The third innovation is represented by the combined adoption of AONT and fountain codes 
for better security and availability guarantees, while reducing the performance overhead of 
current solutions, in scenarios where the nodes composing the DCS are not stable and can 
dynamically join and leave the network. 


The results obtained by MOSAICrOWN and illustrated in this chapter have been published 
in ([BDF* 19b||BDF* 20}. A preliminary version of the AONT-based approach illustrated in this 
chapter has been implemented by one of the tools presented in deliverable D4.1 “First version of 
encryption-based protection tools” [FL20]. 


2.2 Preliminaries 


The two basic building blocks enabling the development of our solution are the adoption, at the 
client side, of All-or-Nothing-Transform (AONT) and of fountain codes. Figure[2. ljillustrates our 
reference scenario. The focus of this chapter is the design of proper slicing of resources and the 
allocation of the produced slices to different nodes in the DCS system. The chapter will also 
illustrate the use of fountain codes, in combination with AONT, to provide availability guarantees 
in case of node failure. 


2.2.1 AONT 


AONT is an encryption mode that requires the use of an encryption key. The encryption driven by 
the key represents the primary protection, and the use of AONT encryption mode further strength- 
ens security. An AONT-encryption mode transforms a plaintext resource into a ciphertext, with 
the property that the whole result of the transformation is required to obtain back the original 
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plaintext. Indeed, absence of even a single block from the ciphertext prevents decryption of any 
other block. AONT guarantees in fact complete interdependence (mixing) among the bits of the 
encrypted resource in such a way that the unavailability of a portion of the encrypted resource 
prevents the reconstruction of any portion of the original plaintext. A party having access to a 
portion of the encrypted resource (but not to the encrypted resource in its entirety): 


e if knowing the encryption key, it will not be able to reconstruct any portion of the resource 
(i.e., it will not be able to derive any information from the AONT-encrypted portions it has; 
the only option would be to attempt a brute force attack on the possible configurations of 
the missing portions, but their possibly large size makes this attack unfeasible); 


e if not knowing the encryption key, it will not be able to perform brute-force attacks for 
guessing such a key, as any key (even the correct one) will be ineffective if not applied to 
the complete resource. AONT protection schemes can be built with the use of common 
cryptographic functions, like symmetric encryption and hash functions. 


An example of an AONT scheme that guarantees complete mixing, which is at the basis of the 
solutions presented in this chapter, is Mix&Slice [BDF* 16]. 

The basic building block of Mix&Slice is the application of a symmetric block cipher operating 
on blocks, which guarantees complete dependency of the encrypted result from every bit of the 
input and the impossibility, when missing some bits of an encrypted version of a block, to retrieve 
the original plaintext block (even if parts of it are known). The larger the number of bits that are 
missing, the harder the effort required to perform a brute-force attack, which requires attempting 
2* possible combinations of values when x bits are missing. Such security parameter is at the 
center of our approach and we explicitly identify a sequence of bits of its length as the atomic 
unit on which our approach operates, which we call mini-block. Applying block encryption with 
explicit consideration of such atomic unit of protection, and extending it to a coarser-grain with 
iterative rounds, our approach identifies the following basic concepts. 


e Block: a sequence of bits input to a block cipher. 


e Mini-block: a sequence of bits, of a specified length (divisor of the size of the block), con- 
tained in a block. It represents our atomic unit of protection. 


e Macro-block: a sequence of blocks. It allows extending the application of block cipher on 
sequences of bits larger than individual blocks. 


The basic step of Mix&Slice (on which it iteratively builds to provide complete mixing within 
a macro-block) is the application of encryption at the block level. This application is visible at the 
top of Figure|2.2| where the first row reports a sequence of 16 mini-blocks ((0],... , [15]) composing 
4 blocks. The second row is the result of block encryption on the sequence of mini-blocks. As 
illustrated by the pattern-coding in the figure, encryption provides mixing within each block so 
that each mini-block in the result is dependent on every mini-block in the same input block. One 
round of block encryption provides mixing only at the level of block. 

The idea of Mix&Slice is to extend mixing to the whole macro-block by the iterative applica- 
tion of block encryption on, at each round, blocks composed of mini-blocks that are representative 
of different encryptions in the previous round (i.e., mini-blocks that belong to the block resulting 
from the encryption of a block in the previous encryption round). For instance, with reference 
to Figure where [0];,...,[15]1 are the mini-blocks resulting from the first round, the second 
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Figure 2.2: An example of mixing of 16 mini-blocks assuming m = 4 


round would apply again block encryption, considering different blocks each composed of a repre- 
sentative of a different computation in the first round (i.e., one mini-block among the first four, one 
among the second four, one among the third four, and one among the last four). To guarantee such 
a composition, Mix&Slice defines the blocks input to the four encryption operations as composed 
of mini-blocks that are at distance 4 in the sequence, which corresponds to say that they resulted 
from different encryption operations in the previous round. The blocks considered for encryption 
would then be ([0]1[4],[8]1[12]:), (1h [5] 9h 13h); 2] 6] [10] 14) Bh 7h15] The 
result would be a sequence of 16 mini-blocks, each of which is dependent on each of the 16 orig- 
inal mini-blocks, that is, the result provides mixing among all 16 mini-blocks, as visible from the 
pattern-coding in the figure. With 16 mini-blocks, two rounds of encryption suffice for guaran- 
teeing mixing among all of them. Providing mixing for larger sequences clearly requires more 
rounds. More precisely, at each round i, mini-blocks are mixed among chunks of m’ mini-blocks 
(with m the number of mini-blocks in a block, 4 in our example), hence ensuring at round i, mixing 
of a macro-block composed of m' mini-blocks. 

An important feature of the mixing is that the number of bits that are passed from each block 
in a round to each block in the next round is equal to the size of the mini-block. This guarantees 
that the uncertainty introduced by the absence of a mini-block at the first round maps to the same 
level of uncertainty for each of the blocks involved in the second round, and iteratively to the next 
rounds, thanks to the use of AES at each iteration. This implies that a complete mixing of the 
macro-block requires at least log,,, (m - b) rounds, that is, the rounds requested by our technique. 

Another crucial aspect is that the representation of the resource after each round (i.e., the 
output of each round) has to be of the same size as the original macro-block. In fact, if the 
transformation produced a more compact representation, there would be a possibility for a user 
to store this compact representation and maintain access to the resource even after revocation to 
weaken the security guarantees provided by the encryption mode. 

When resources are extremely large (or when access to a resource involves only a portion 
of it) considering a whole resource as a single macro-block may be not desirable. Even if only 
with a logarithmic dependence, the larger the macro-block the more the encryption (and therefore 
decryption to retrieve the plaintext) rounds required. Also, encrypting the whole resource as a 
single macro-block implies its complete download at every access, when this might actually not 
be needed for service. Accounting for this, we do not assume a resource to correspond to an 
individual macro-block, but assume instead that any resource can be partitioned into M macro- 
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Figure 2.3: From resource to fragments 


blocks, which can then be mixed independently. Encryption of a resource would then entail a 
preliminary step cutting the resource in different, equally sized, macro-blocks on which mixing 
operates. To ensure the mixed versions of macro-blocks be all different, even if with the same 
original content, the first block of every macro-block is XORed with an initialization vector (IV). 

The starting point for introducing mixing is to ensure that each single bit in the encrypted 
version of a macro-block depends on every other bit of its plaintext representation, and therefore 
that removing any one of the bits of the encrypted macro-block would make it impossible (apart 
from brute-force attacks) to reconstruct any portion of the plaintext macro-block. Such a property 
operates at the level of macro-block. Hence, if a resource (because of size or need of efficient 
fine-grained access) has been partitioned into different macro-blocks, removal of a mini-block 
would only guarantee protection of the macro-block to which it belongs, while not preventing 
reconstruction of the other macro-blocks (and therefore partial reconstructions of the resource). 
Resource protection can be achieved if, for each macro-block of which the resource is composed, 
a mini-block is removed. Slicing the encrypted resource consists in defining different fragments 
such that a fragment contains a mini-block for each macro-block of the resource, no two fragments 
contain the same mini-block, and for every mini-block there is a fragment that contains it. To 
ensure all this, as well as to simplify management, Mix&Slice slices the resource simply putting 
in the same fragment the mini-blocks that occur at the same position in the different macro-blocks. 
Figure [2.3{b) illustrates the slicing process 


2.2.2 Fountain codes 


Fountain codes are a class of erasure codes preventing that the loss of one of the transmitted or 
stored blocks of a resource causes a data loss. Given a resource r, partitioned into f different 
fragments, an erasure code generates a set of s>f encoded slices that depend on the resource 
content and support the reconstruction of r through the combination of a subset of the encoded 
slices. Fountain codes, unlike other erasure codes (e.g., Reed-Solomon [RS60)), offer probabilistic 
reconstruction guarantees, meaning that with a probability p < 1, s of the f slices are sufficient for 
reconstructing r. The reconstruction probability p exponentially increases by retrieving additional 
slices. Although probabilistic, fountain codes have two main characteristics that, as we will discuss 
in the next section, allow us to profitably use them in the DCS context. First, they are rateless, that 
is, using these codes it is possible to create a new (i.e., different from each other) slice on the fly 
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and therefore the number s of encoded slices is not fixed a priori. Independently from the number 
s of slices, any subset of (at least) f slices can be used to reconstruct r. Second, each slice depends 
on a subset of (and not on all) the f original fragments of the resource and then only a subset of 
the original fragments are needed for generating a new slice. 


2.3 Allocation properties 


In our approach, the slicing of the resources into several slices to be distributed at the different 
nodes of the data market is guided by the availability and protection properties that need to be 
guaranteed. Availability (despite nodes failure or temporary unreachability) is provided through 
replication, security is provided through protection against malicious coalitions. Malicious nodes 
(and coalitions thereof) are interested in making the resource unavailable, by not returning the 
slices of the resource they store, or in providing access to a resource even after its deletion, by not 
removing the slices of the resource they store and returning such slices to (not authorized) sub- 
jects who pay for it. Before addressing slicing, we then characterize the replication and coalition 
resistance properties of the distribution of a resource. 

We assume a (transformed) resource that has undergone AONT encryption (as described in 
the previous section) at the client side. For simplicity, we will omit such an explicit remark on 
transformation, and we will simply use the term resource to denote an AONT-encrypted resource. 
Also, we assume a resource to be composed of different slices, for storage in a data market relying 
on a DCS. We will address the problem of producing such slices in Section [2.4] 

We model a resource as a set Z = {s1,...,85} of slices to be allocated to the nodes, denoted 
MW, of the DCS. The following definition formalizes slice allocation. 


Definition 2.3.1 (Allocation function) Let Z be a set of slices composing a resource and WN be 
a set of nodes. An allocation function p : Z > 2” VO assigns each slice s; € Y to a set of nodes 
(si) =Ny CW, Ni AN. 


The allocation function dictates how slices are allocated to nodes in the DCS. The considera- 
tion of sets of nodes (in contrast to individual nodes) in the co-domain accommodates replication. 
The exclusion of the empty set of nodes ensures lossless distribution (i.e., each slice is allocated 
to at least one node). Figure [2.4 illustrates an example of an allocation function, considering a 
resource split into ten slices (4 = {s1,...,810}) allocated to five nodes (n;,,...,ns) in the DCS 
(nodes not used in the allocation are not reported in the figure). The figure has a row for each 
node and a column for each slice. The allocation of a slice to a node is represented by a gray box 
at the intersection between the row representing the node and the column representing the slice. 
Empty boxes with a dotted frame represent the fact that the slice is not allocated to the node. For 
example, @(s1) = {n,n}. 

We identify two main properties of an allocation, characterizing the availability, provided by 
replication, and the protection against possible malicious coalitions of nodes, provided by the 
diversification of the allocation. 

We characterize availability provided by replication in terms of the number of replicas main- 
tained in the system. While in principle the number of replicas maintained for each slice can 
differ, we assume the same number of replicas is used for all the slices. This derives from the 
fact that we assume that nodes are not associated with individual reliability profiles (Section[2.5). 
Since all slices are needed to reconstruct the resource, using fewer replicas for any of the slices 
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Figure 2.4: An example of a minimal 3-protected and 2-replicated allocation function 


would decrease the availability of the resource, which will be dictated by such a lower bound. The 
following definition formalizes the replication degree of an allocation function. 


Definition 2.3.2 (r-Replicated allocation function) Let .4 be a set of slices composing a re- 
source, N be a set of nodes, and Q be an allocation function. Function Q is r-replicated iff 
Vsi € 4, lo(s;)| >r. 


For instance, the allocation function in Figure[2.4]is 2-replicated, as two copies are maintained 
for each slice. 

We characterize the protection offered by an allocation in terms of the minimum number of 
nodes required to reconstruct a resource, as formalized by the following definition. 


Definition 2.3.3 (k-Protected allocation function) Let Z be a set of slices composing a resource, 
WN be a set of nodes, and @ be an allocation function. Function Q is k-protected iff for each 
Ni G MN, with INi] < k, ds; € S St. p(sj)NN; =. 


A k-protected allocation function guarantees distribution of slices to nodes in such a way to 
dictate the cooperation of no less than k + 1 nodes to collect all the slices composing the resource 
(and hence enabling retrieving its plaintext). In other words, a k-protected allocation function 
guarantees protection of the resource against malicious (i.e., colluding) behavior of up to k nodes. 
In fact, with a k-protected allocation function, for each coalition of k nodes in 4, there is at least 
a slice that is not stored at any of the nodes in the coalition. Hence, such a coalition can neither 
decrypt the resource with a brute-force attack, nor prevent its deletion. The allocation function in 
Figure [2.4]is 3-protected: any subset of 3 out of the 5 nodes misses at least a slice. For instance, 
coalition {n1,n2,n3 } misses slice s10, while coalition {n1,n2,n4} misses slice sg. On the contrary, 
the allocation function in Figure [2.5] on the same slices and nodes, is not 3-protected (but only 
2-protected): coalition (n,,n3,n4) jointly possesses all the slices. 

We refer to an allocation function that is r-replicated, according to Definition [2.3.2] and k- 
protected, according to Definition|2.3.3| as a (k,r)-allocation. 


Definition 2.3.4 ((k,r)-Allocation) Let Y be a set of slices composing a resource, N be a set of 
nodes, and Q be an allocation function. Function @ is a (k,r)-allocation iff it is k-protected and 
r-replicated. 


According to Definitions and a (k,r)-allocation is also a (k’,r’)-allocation, for any 
r’ < r and any k’ < k. In fact, trivially, an allocation function providing r replicas also provides 
r’ < r replicas. Analogously, an allocation function protecting a resource from coalitions of k 
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Figure 2.5: An example of 2-replicated allocation function that is not 3-protected 


nodes also protects the resource from coalitions of k’ < k nodes. Among all (k,r)-allocations, we 
are interested in identifying those for which k and r represent the highest values satisfying the 
availability and protection properties (i.e., satisfying the properties in a minimal way). We call 
such allocation functions minimal, as formalized by the following definition. 


Definition 2.3.5 (Minimal (x,r)-allocation) Let Z be a set of slices composing a resource, N 
be a set of nodes, and @ be a (k,r)-allocation. Function @ is minimal iff: 


1. it is not (k+ 1)-protected; 


2. Vsi € $, 


p(si)|=". 


According to Definition|2.3.5] a minimal (k, r)-allocation is an allocation that guarantees pro- 
tection against coalitions of up to k (but no more) nodes and that uses exactly r replicas. The allo- 
cation function in Figure[2.4]is an example of minimal (3,2)-allocation. In the following, we will 
restrict our attention to minimal allocation functions and, when talking about a (k,r)-allocation, 
we will implicitly assume such minimality. 


2.4 Slicing and allocation strategies 


In the absence of replication, producing an allocation that guarantees k-protection, that is, a (k, 1)- 
allocation, is straightforward: it is sufficient to split the resource into k + 1 slices and allocate 
each slice to a different node. When considering replication, different approaches can be taken 
for allocation, differing in the granularity of slicing and in how allocation diversifies the storage 
at different nodes. In the following, we discuss these options. In the discussion, in addition to 
parameters k and r introduced before, we will use parameters s, denoting the number of slices in 
which a resource is split, and n, denoting the number of nodes to be involved in the allocation of a 
resource. Different approaches vary in the number s of slices to be considered and in the number 
n of nodes to be involved for providing a (k, r)-allocation. We note that, with respect to nodes, the 
only parameter to be considered in the allocation strategies is the number n of nodes to be involved 
(the specific nodes to be involved can be selected randomly). We identify and study the behavior of 
two approaches for producing a (k,r)-allocation. The first approach aims to minimize the number 
of slices (Min_slices), while the second aims to minimize the number of nodes (Min_nodes). We 
analyze these two approaches as they represent the two extremes with respect to granularity of 
slicing and diversification of allocation. Their analysis permits to highlight the characteristics 
of fine-grained (Min_nodes) and coarse-grained (Min_slices) slicing, and can also represent a 
reference for intermediate configurations. 
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Figure 2.6: An example of (3,2)-allocation that minimizes the number of slices 


2.4.1 Minimizing the number of slices 


We start noting that the number s of slices involved for guaranteeing a (k, r)-allocation must be 
such that s > k+ 1. In fact, there should be at least k + 1 slices to guarantee k-protection. 

A simple approach for determining a (k, r)-allocation extends the natural approach of produc- 
ing k + 1 slices, by simply considering their replication at different nodes. Such an approach is 
characterized by a coarse-slicing, since minimizing the number of slices clearly entails a larger 
size for them, and by consistent replication (1.e., nodes have no intersection or complete intersec- 
tion of stored slices). 

We observe that a (k,r)-allocation function using the minimum number (s = k + 1) of slices 
implies that: 


1. a node maintains at most one slice; 


2. the number of nodes involved in the allocation is exactly r times the number of slices (i.e., 
n=r-(k+1)). 


The first observation derives from the fact that, since there are only k+ 1 slices, placing more 
than one slice on a node would imply the existence of a set of k nodes able to reconstruct the re- 
source and therefore would not guarantee k-protection anymore. The second observation naturally 
derives from the first, considering that every slice needs to be replicated r times. 

As an example, a (3,2)-allocation using the minimum number of slices would imply splitting 
the resources into 4 (= 3+ 1) slices, generating 2 copies of each slice, to be distributed at 8 
different nodes. Figure[2.6jillustrates an example of allocation function enforcing this. 

A (k,r)-allocation that uses the minimum number of slices s = k + 1 well resists to failures. 
Indeed, k + 1 nodes out of r - (k + 1) are sufficient to reconstruct the resource content, as long as 
one replica of each slice is available. However, the number of nodes used by such an allocation 
function quickly grows with k and r. For instance, a (10,5)-allocation would need 55 (= 5- (10+ 
1)) nodes. 


2.4.2 Minimizing the number of nodes 


At the other end of the spectrum of possible strategies for defining and distributing slices to guar- 
antee a (k, r)-allocation, there are functions minimizing the number of nodes to be involved in the 
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distribution (and deriving the number of slices in which the resource needs to be split based on 
this). 

A trivial lower bound on the number of nodes that need to be involved in a (k,r)-allocation 
is n > max(k+1,r), since there should be at least r nodes to hold r replicas and at least k + 1 
nodes to guarantee k-protection. The minimum number of nodes to be involved to guarantee 
(k, r)-allocation is actually higher than that as it needs to be at least the sum of the protection and 
replication parameters (k and r). 

This derives from two simple observations. First, to guarantee k-protection, for each coalition 
of k nodes, there must exist at least one slice that is not stored at any of the nodes in the coalition. 
Second, to provide r-replication, such a slice should be stored at (at least) r nodes that are not in 
the coalition. As we will illustrate in the following, k +r nodes, besides been necessary, are also 
sufficient to define a (k,r)-allocation. 

While using the minimum number of slices applies a coarse slicing with consistent replication, 
using the minimum number of nodes applies a fine-grained slicing with diversified replication 
across nodes. Intuitively, instead of splitting the resource into slices and allocating to each node a 
single slice, minimizing the number of nodes requires slicing the resource into more fine-grained 
slices and allocating the slices to nodes in a diversified manner, to guarantee that no set of k nodes 
jointly possesses all the slices. The definition of the allocation requires then to identify the number 
of slices in which a resource needs to be split, which must be sufficient to distribute the r replicas 
to nodes while ensuring k-protection. The minimum number of slices needed for ensuring that no 
set of k nodes is able to reconstruct the resource when using k +r nodes, clearly happens when 
any set of k nodes misses exactly one slice (which, given r-replication, would instead be stored at 
the r nodes not belonging to the set) and no two coalitions miss the same slice. In fact, if two sets 
of k nodes miss the same slice, such a slice could not have r replicas when using only k +r nodes. 
The number of required slices can then be identified as the number of coalitions of k nodes out of 
k+r, that is (t). 

A (k,r)-allocation that uses k +r nodes and (+) slices has two interesting properties. The 
first one, already noted, is that any coalition of k nodes misses exactly one slice. The second one, 
deriving from the fact that the missing slice is different for different coalitions, is that any set 
of k+ 1 nodes is sufficient to reconstruct the resource (differently from the Min_slices approach 
where at least k+ 1 nodes are needed to reconstruct the resource but not any set of k +1 nodes 
guarantees that). 

A (k,r)-allocation that minimizes the number of nodes can be obtained by assuming .V to 
comprise k +r nodes and proceeding as follows. Let 2; = {N; € 2” : |Ny| = k} be all subsets 
of k nodes in M. For each slice s; € .%,i=1,..., Cy (si) ={% \ {Ni} with N¿€2/ 3. 
Intuitively, for each slice si, @(s;) selects a coalition of k nodes that misses s; and allocates slice 
s; to all the other nodes. This guarantees that each coalition (Ni) of k nodes misses at least one 
slice (si), providing k-protection. Slice si, which represents the missing slice for coalition Nj, 
is stored at all the other n — k = r nodes in M, providing r-replication. Intuitively, in a (k,r)- 
allocation using the minimum number of nodes, no two slices are allocated exactly to the same set 
of nodes (i.e., Vsi,sj € S, (si)#(s;)). In fact, the possible subsets of r nodes in VW is ies 
and ("5") = ("7"). 

For example, a (3,2)-allocation using the minimum number of nodes requires n = k +r = 
3+2 = 5 nodes and the use of s (+) G ) 10 slices. Figure [2.4ļillustrates an example of 


3 
(3,2)-allocation distributing 10 slices over 5 nodes. The allocation is a (3,2)-allocation since it 


replicates each slice twice while guaranteeing that no coalition of 3 nodes possesses all the slices. 
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More precisely, any coalition of 3 nodes misses exactly one slice and the missing slice is different 
for any of such coalitions. For instance, coalition {n1,n2,n3} misses slice si9, while coalition 
{ni,n2,n4} misses slice so. 


2.4.3 Discussion 


For simplicity, we have assumed that the data market can arbitrarily split resources as needed 
for the definition of a (k,r)-allocation. However, thanks to its flexibility, our approach can be 
adopted also when the encrypted resource is already organized in chunks that cannot be split for 
allocation (e.g., blocks resulting from the AONT algorithm adopted), or in general when slicing is 
constrained. Indeed, even if in the discussion, for simplicity, we consider slices of equal size, our 
approach can be adopted also if the size varies. Also, slices can contain non-contiguous chunks 
of the resource. Clearly, the number of chunks should be sufficient for the definition of a (k,r)- 
allocation (e.g., k+1 and (i) in our two alternative configurations). If the resource includes fewer 
chunks, it needs to be padded. If the resource includes more chunks than necessary, the data market 
can combine the chunks in s slices and apply the chosen allocation function over these slices. As 
an example, to define a (3,2)-allocation for a resource organized in 20 chunks using 5 nodes, 
chunks can be arbitrarily combined to identify 10 slices for allocation. Alternatively, k-protection 
and r-replication can be obtained by considering each chunk as a different slice and interpreting 
the allocation function as periodic in s, or simply by randomly allocating the chunks after the first 
s (which are the ones necessary to guarantee k-protection). For instance, a (3,2)-allocation for 
a resource with 20 chunks using 5 nodes can be obtained by applying the allocation function in 


Figure [2.4]twice (on slices s1,...,S10 and sy1,...,S20), or by using it for slices s1,...,S819 while 
arbitrarily allocating slices s11,...,S20 at two nodes each. 


2.5 Availability and protection guarantees 


Parameters r and k introduced in the previous section characterize the degree of replication and 
of protection against malicious coalitions of nodes. Such parameters provide a clean and pre- 
cise modeling and allow reasoning about properly setting the number of slices and the number 
of nodes to be involved in the allocation. The setting of k and r to provide given security and 
availability guarantees clearly depends on the specific characteristics of the network. For instance, 
in a stable network a low number of replicas may suffice to provide high availability, while in a 
highly dynamic and non-resilient network a higher number of replicas should be used to enjoy 
the same guarantee. In the same vein, actual protection against possible exposure of a resource to 
malicious coalitions depends on the nature of nodes involved in the allocation. Consistently with 
these observations, we note that a natural way to express and reason about availability and protec- 
tion guarantees is the probability of the resource to become unavailable and the probability of a 
coalition of malicious nodes to jointly possess all the resource slices. In this section, we illustrate 
how to derive proper r and k settings to be then used for splitting resources into slices and for 
slices allocation, starting from the aimed guarantee of availability and security expressed in terms 
of such probabilities. Clearly, the probability of a resource to become unavailable, or exposed 
to malicious coalitions, depends on the probability of individual nodes to become unavailable or 
behaving maliciously. We then introduce the probability of a single node to fail, and hence to 
become unavailable, denoted p,,, and the probability of a node to behave maliciously, and hence to 
participate in a malicious coalition compromising protection, denoted p.. We assume, as common 


34 MOSAICrOWN Deliverable D4.2 


Section 2.5: Availability and protection guarantees 37 


5 10 15 20 25 


(a)k=1,...,25,r=5 


(c)k=5,r=1,...,25 (d)k=5,r=1,...,25 


Figure 2.7: Probability that the resource is unavailable (a,c) and that it is exposed (b,d) using a 
(k, r)-allocation that minimizes the number of slices, with r=5 varying k between 1 and 25 (a,b), 
and with k=5 varying r between 1 and 25 (c,d) 


in decentralized systems, the probability p, of failure to be the same for all nodes and the failure 
of any node not to be influenced by the failure of the other nodes. This assumption enables a clean 
modeling, which can be taken as a reference for reasoning on different probability distributions. 
Since the selection of storage nodes is driven by a pseudorandom function, we also consider a 
uniform probability p. of compromise and assume independence of compromise events on differ- 
ent nodes. We introduce the probability of a resource to become unavailable, denoted P,,, and of 
being exposed to a malicious coalition, denoted P., when using a (k,r)-allocation. The analysis 
will then guide the identification of the values for k and r to be used to guarantee that P, and P, 
do not exceed a given threshold. We discuss separately the Min_slices and Min_nodes allocation 
strategies introduced in the previous section, which, as we will see, exhibit a different behavior 
with respect to availability and security guarantees. 


2.5.1 Min slices allocation 


Using a (k,r)-allocation with the minimum number of slices, unavailability of a resource happens 
when, for any of the k+ 1 slices composing the resources, all the r nodes storing the replica of the 
slice fail. The probability of such an event to happen is P, = 1 — (1 —(p,)")**!, where (1 — (p,,)’) 
is the probability that one of the r replicas of a slice is available and, for the assumption on 
the independence of the failure events, (1 — (p,,)")**! is the probability that one replica of each 
of the k+ 1 slices is available. In the same vein, the resource becomes exposed (and hence a 
compromise happens and deletion cannot be guaranteed) when a coalition of malicious nodes 
collectively possesses all the k + 1 slices, that is, when the coalition contains k + 1 nodes each 
possessing a different slice. The probability of such an event to happen is P; = (1 — (1 — p-)")**1, 
where (1 — pc)” is the probability that one replica is stored on a node that is not part of a coalition 
and, consequently, 1 — (1 — pe)” is the probability that one replica is exposed. Since such an 
exposure must involve all the k+ 1 slices, the probability that a coalition possesses all the slices is 
(1 = a =p yr, 

Figure[2.7|illustrates how k and r affect the values of P, (Figure[2.7[a,c)) and P, (Figure[2.7[b,d)), 
considering different values of p, and pe, respectively. The values considered for p, and pe are 
0.2, 0.4, 0.6, and 0.8. These values, extremely pessimistic with respect to what can be expected in 
real systems, have been chosen to study the behavior of the probabilistic formulas. Figure Pa) 
reports the values of P,, assuming a fixed number r = 5 of replicas and varying k between 1 and 25. 
Figure [2.7{c) reports the values of P, assuming a fixed k = 5 and varying the number r of replicas 
between 1 and 25. Figures [2.7{b,d) report the values of P, in the same settings of Figures [2.7[a,c). 
As it can be seen from Figure 2.7{a), P, increases as the value of k increases, because the number 
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(a)k=1...,25,1=5 (b) k=1,...,25,r=5 (d)k=5,r=1,...,25 


Figure 2.8: Probability that the resource is unavailable (a,c) and that it is exposed (b,d) using a 
(k, r)-allocation that minimizes the number of nodes, with r = 5 varying k between 1 and 25 (a,b), 
and with k = 5 varying r between 1 and 25 (c,d) 


of nodes used in the allocation increases and therefore the probability of availability of a larger 
number of slices decreases. Indeed, the number of nodes necessary to reconstruct a resource grows 
with k (it is k+ 1), and the probability of availability of all the nodes necessary to reconstruct the 
resource decreases. However, P, remains low if the failure probability of a single node p, is low. 
Probability P, instead decreases as the value of r increases (Figure[2.7(c)), because each slice will 
be stored on a larger number of nodes, reducing the risk of unavailability. Figure [2.7{b) shows 
that P, decreases as k increases because the number of nodes that should be part of a coalition 
increases, meaning that the probability of forming a coalition decreases. Probability P, increases 
as r increases (Figure 2.7{d)), because the number of replicas of each slice increases and therefore 
also the probability that one replica is stored on a compromised node increases. 


2.5.2 Min_nodes allocation 


Using a (k,r)-allocation with the minimum number of nodes, the unavailability of the resource 
occurs when any combination of r (or more) nodes becomes unavailable. In fact, regardless of 


the slices that those nodes store, such an event causes at least one slice to be unavailable. The 
k+r g ; . 

probability P,, that a resource becomes unavailable is then P, = Y ey ) (Pa (A — pu) t, where 
i=r 

the binomial coefficient e ) is the number of all possible combinations of i nodes over k +r, 

with i varying in the range r,...,k +r, that can be unavailable; (p,)! is the probability that i nodes 


are unavailable; and (1 — pu) ti 


is the probability that the remaining nodes (i.e., k +r — i) are 
available. In the same vein, any coalition of k+ 1 nodes causes an exposure of the resource, 
regardless of the slices they store. Relying on the minimum number of nodes, in fact, implies 


that any coalition of k+ 1 nodes possesses all the slices. The probability P, of a compromise is 


k+r r E ; 
then RP= Y Ce )(pe)'(1 = pe)**", where the binomial coefficient eu ) is the number of all 
i=k+1 


1 1 
i= 
possible coalitions of i nodes over k +r nodes, with i varying in the range k+1,...,k +r; (pe)! is 
the probability that i nodes form a coalition; and (1 — p,)**+’~‘ is the probability that the remaining 
nodes (i.e., k+ 7 — i) are not compromised. 

Figure illustrates how k and r affect the values of P, (Figures P.8[a,c)) and P, (Fig- 
ures [2.8{b,d)), considering different values of p, and p., respectively. The values considered for 
Pu and pc are 0.2, 0.4, 0.6, and 0.8. Figure[2.8{a) reports the values of P,, assuming a fixed number 
r = 5 of replicas and varying k between 1 and 25. Figure P.8[c) reports the values of P, assuming 
a fixed k = 5 and varying the number r of replicas between 1 and 25. Figures P.8[b,d) report the 
values of P, in the same settings as Figures|2.8[a,c). From the figures, it is immediate to see that P, 
and P, present a similar behavior when adopting a configuration minimizing the number of slices 
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and of nodes (1.e., P, increases as k grows and decreases as r grows, while P, decreases as k grows 
and increases as r grows). 


2.5.3 Setting k and r 


Our modeling of the probability that a resource is not available (P,,) and that it is exposed (P,) can 
be used to set appropriate values for parameters k and r. To this purpose, fixing the maximum 
threshold P;/'“* of resource unavailability and P/"“* of resource exposure, we compute all the con- 
figurations of k and r that guarantee P, < P/"™ and P, < P’"“*. Clearly, the values of k and r for 
the configurations satisfying the thresholds depend on the chosen allocation function. 

Comparing the evolution of the probability P, that a resource becomes unavailable using the 
Min_slices and Min_nodes allocation strategies, varying k (Figure Rta) and Figure P.8[a), we 
can easily see that Min_slices is more robust against node failure (i.e., P, increases slowly) than 
Min_nodes. This is due to the fact that even if the number of nodes involved in the allocation 
increases in both configurations, with an allocation that minimizes the number of nodes the impact 
of a node failure on the availability of the resource is significant. A similar comment applies when 
comparing how P, evolves in the two configurations varying the number r of replicas (Figure[2.7{c) 
and Figure [2.8{c)). In this case, the decrease of P, with Min_slices is faster than the decrease of 
P, with Min_nodes. Therefore, we can conclude that, for configurations with the same values for 
r and k, Min_slices exhibits higher availability. 

Comparing the evolution of the probability P, that a resource is exposed due to a coalition of 
at least k+ 1 nodes using Min_slices and Min_nodes allocation strategies, varying k (Figure[2.7[b) 
and Figure|2.8[b)), we can easily see that Min_nodes is more robust (i.e., P+ decreases faster) than 
Min_slices. This is due to the fact that, with an allocation that minimizes the number of nodes, the 
probability of forming a coalition of at least k + 1 nodes among the k +r nodes is smaller than the 
probability of controlling at least one of the r nodes for each of the k + 1 slices of the allocation 
that minimizes the slices. A similar comment applies when comparing how P, evolves in the two 
configurations varying the number of replicas (Figure [2.7{d) and Figure [2.8{d)). The increase of 
probability P, using Min_slices is faster than the increase of P, using Min_nodes, because it is 
more difficult to control at least k + 1 of the k +r nodes than to control at least one node in each of 
the distinct k + 1 groups of r nodes. Therefore, we can conclude that, for configurations with the 
same values for r and k, Min_nodes exhibits higher security. 

When the allocation function has been chosen, given the maximum threshold P’"™ of resource 
unavailability and P’"“* of resource exposure, different configurations of k and r guarantee that 
Py < Pir and P, < P”™. Among all these configurations, the ones with low replication factor 
(r) require less storage and have lower economic costs, while the ones involving a limited number 
(n) of nodes enjoy simplicity in the management of the system and better performance of access 
operations (less connections have to be established). Figure [2.9|considers three different network 
configurations, characterized by a different probability p, for single nodes to fail and a different 
probability pe to behave maliciously, and illustrates the configurations of k and r satisfying the 
above thresholds using Min_slices and Min_nodes allocation strategies. In the figure, the orange 
area on the top-left represents the configurations of k and r that satisfy the availability requirement 
(i.e., P, < 1077), while the blue area on the bottom-right represents the configurations that satisfy 
the security requirement (i.e., P. < 10~°). The intersection between the orange and blue areas 
represents configurations that provide both availability and security guarantees. Among these 
configurations, the one located on the left/bottom corner of the intersecting area is the one to be 
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Figure 2.9: Min_slices and Min_nodes (k,r)-allocations that guarantee P, < 1077 and P. < 1076 


with different values for pu and pe 


preferred as the number of nodes and replicas is minimum. 

Figures P.9[a,b) consider nodes with p, = 0.005 and pe = 0.2. The optimal configuration 
for Min_slices it is k = 26 and r = 4 (1.e., n = 108), while for Min_nodes allocation is k = 12 
and r = 5 (i.e., n= 17). The second allocation, although more expensive on storage, due to 
one additional replica, considerably reduces the number of nodes involved in the storage of the 
resource compared to the adoption of the first allocation function. Our analysis demonstrates that 
this is a general behavior: Min_nodes requires the same (or a slightly higher) number r of replicas 
and a significantly lower number n of nodes than Min_slices. This observation is confirmed by 
the extreme scenarios illustrated in Figures [2.9{ c,d), considering highly reliable (p,, = 0.001) but 
lowly trusted (pe = 0.5) nodes, and in Figures [2.9{e.f), considering unreliable (p, = 0.05) but 
relatively trusted (pe = 0.1) nodes. The optimal configurations in Figures [2.9{ c,d) are k = 100 and 
r = 3 for Min_slices (1.e., n = 303), and k = 27 and r = 4 for Min_nodes (i.e., n = 31, meaning 
that the number of nodes is ten times smaller). The optimal configurations in Figures [2.9{¢.f) are 
k = 10 and r =9 (i.e., n = 99) for Min_slices, and k = 18 and r = 7 (1.e., n = 25) for Min_nodes. 

Our analysis confirms that, for a wide range of values for P, and P, and assumptions on the 
node availability p, and compromise risk pe, our approach is able to identify a configuration of 
r and k with manageable complexity (i.e., a reasonable number of replicas and of nodes). We 
note that, even when r and k grow, the minimum number of slices composing a resource remains 
limited. 


2.6 Encoding strategy for improving availability 


Typically, DCS networks provide availability by employing slice replication. Instead of simple 
replication in the allocation, several DCS networks leverage the dynamic application of erasure 
codes (e.g., Reed-Solomon). With Reed-Solomon, if a node participating in the service becomes 
unavailable, its slices are dynamically re-allocated to another node. The advantage of Reed- 
Solomon with respect to simple replication is that it provides reliability guarantees at a fraction 
of the storage overhead that would come with replication. However, it has two main drawbacks. 
First, it is a fixed-rate encoding technique and therefore computing the slice to be re-allocated 
requires reconstructing the complete resource. Second, if the old node originally storing a (then 
re-allocated) slice comes back online, more replicas than actually needed would be available for 
the same slice and the economic cost brought by the involvement of more nodes does not bring 
a clear advantage for availability. To overcome these drawbacks, we propose to adopt fountain 
codes, in contrast to traditional erasure codes like Reed-Solomon, for computing slices to be dis- 
tributed to nodes in a DCS. Our approach avoids potentially excessive generation of slices which 
may turn out to be unnecessary when — as it is often the case — nodes unavailability is only tempo- 


$ MOSAICrOWN Deliverable D4.2 


Section 2.6: Encoding strategy for improving availability 41 


w 3. Mill Reed-Solomon “3 HA Fountain code 
Bl o 
8 o 
22 o 2 
o S 
= 
5 = 
21 21 
£ E 
2 = 
=o =o 
eR GF SN RF G F F B CORRA ARA 
slice ID slice ID 
(a) Reed-Solomon (b) Fountain codes 


Figure 2.10: An example of slice generation/replication using Reed-Solomon (a) and fountain 
codes (b) in a dynamic scenario 


rary. 

Given a resource, encrypted using AONT to protect confidentiality, the ciphertext is encoded 
using fountain codes to provide availability. The ciphertext is organized in f original fragments 
and encoded into a set .Y = {s1,...,8s} of s slices allocated to s randomly chosen nodes VW = 
{ni,...,ns} Ge., each slice is allocated to a different node). To retrieve the resource content, it 
is sufficient to contact an arbitrary subset of f nodes and download their slices, which are then 
combined to reconstruct the resource. 


Since fountain codes are rateless, if a node n; leaves the DCS, the data market does not need 
(as it would happen using Reed-Solomon) to reconstruct its slice s; and re-allocate it to a different 
node to guarantee the same resource availability. Indeed, it is sufficient to generate a new slice Ss+1 
(different from s;,...,S,) and allocate it to a new node ns+1. After the generation and allocation 
of ss41, any subject who is authorized for the resource can still reconstruct the resource content 
by downloading any subset of f slices from the resulting set (.Y\{si})U{ss+41} of s slices. The 
generation of a new slice, in contrast to the replication of the unavailable one, provides higher 
resource availability. Indeed, every slice is unique and equally contributes to the reconstruction of 
the resource. Hence, when the previously unavailable node n; comes back online, the increased 
economic cost due to the involvement of a larger number (s + 1) of nodes comes with an actual 
advantage in terms of higher availability. In fact, there will be s + 1 (instead of s) different slices 
stored at s+ 1 different nodes that can all be used for resource reconstruction. Note that the 
generation of a new slice causes a limited overhead since it implies the download of a subset of 
the slices in .”, without the need to reconstruct the resource (as required by other erasure codes). 


As discussed, the rateless characteristic of fountain codes allows the adaptive adjustment of 
the number of slices available for reconstructing a resource, thus impacting the availability and 
security guarantees. We use this characteristic (see Section in such a way that, when the 
availability guarantees go below a given threshold, the data market can generate a new slice. Anal- 
ogously, when the risk of confidentiality exposure is above a given threshold due to the presence 
of a high number of slices, the data market can ask the data owner to re-encrypt the resource and 
generate a new set of s slices. 


Figure [2.10illustrates an example that compares the set of slices in a DCS when using Reed- 
Solomon and fountain codes, assuming the same number of nodes that leave/join the DCS. Here, 
we assume that the data market partitions the resource in f=7 fragments and encodes it in s=10 
slices and that nodes where the slices are stored leave and then possibly re-join the network. Since 
Reed-Solomon reacts to a node failure replicating the slice of the failed node, the set .7 of slices 
representing the resource never changes but the number of copies grows (e.g., s7 has three copies). 
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Fountain codes do not cause any replication, but a new slice is generated at each failure. Note that 
the two techniques imply the same economic cost for the data market, since each failure causes 
the allocation of a (new or duplicated) slice to a node. In the considered example, the market pays 
for 15 slices in both scenarios. 


2.7 Availability and security guarantees with encoding 


Node failure has an impact on both resource availability and confidentiality. Indeed, when one of 
the nodes n;i fails, the encoded slice s; it stores cannot be used to reconstruct the resource content. 
Hence, to reconstruct the resource, it is necessary to retrieve f out of s— 1, in contrast to s, 
slices. When n; re-joins the DCS, it still stores s; and this could have a positive effect on resource 
availability, since the slice at the node could be used to reconstruct the resource. However, node n; 
re-joining the DCS could have a negative impact on security, since n; could exploit its knowledge 
of sí; and collude with other f — 1 nodes to reconstruct the resource (or prevent its deletion). 
Similarly, the generation and allocation of a new slice ss;1 improves resource availability, but 
also naturally reduces security since there is a higher number of nodes that could possibly collude. 

In this section, we analyze the advantages provided by fountain codes on availability guaran- 
tees, by studying the probability P, that a resource becomes unavailable as a consequence of nodes 
leaving and re-joining the network. We also evaluate the risks that the adoption of our solution 
causes in terms of security, by studying the probability P, that a coalition of malicious nodes has 
enough slices to compromise resource security. These probabilities depend on the probability p, 
that a node fails, and on the probability p, that a node is malicious and interested in colluding with 
other nodes to breach the confidentiality of the resource. For simplicity, we assume p, and pe to 
be the same for all the nodes. 


2.7.1 Availability guarantees 


When using s slices to encode a resource split into f original fragments, the resource becomes 
unavailable when more than s — f nodes fail (i.e., when less than f slices can be accessed). The 
probability of such an event to happen is P,¿=Y'5_¿_ pet (5) Pal — pal . Probability P, increases 
when one of the nodes fails (or leaves the DCS) since the slice it stores is no more available, while 
it decreases any time a new slice is generated or a failed node re-joins the network. 

Even if, in principle, the data market should react every time a node n; fails by generating a 
new slice, this practice is expensive and may not even be necessary (e.g., if ny re-joins the network 
in a few hours). The increase in P, caused by the failure of a node may not be critical in a scenario 
where nodes dynamically leave and re-join the system frequently. Indeed, the reduction in P, 
may be temporary and may not considerably affect the possibility to reconstruct the resource. To 
properly take into consideration these aspects, we propose a solution where the data market takes 
corrective actions (i.e., create a new slice) only when resource availability is considered at risk. 

Our solution is based on the definition of two thresholds for P,, P” and P""“*, identifying the 
range of values considered acceptable by the data market for guaranteeing the availability of the 
resource. Intuitively, these thresholds influence the maximum and minimum number of available 
slices in the system and represent: 


e Pir: the maximum probability of failure that the data market can tolerate, which cor- 
responds to the minimum number of slices (> f) the market considers desirable to keep 
resource unavailability under control; 
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Figure 2.11: Probability that a resource becomes unavailable using Reed-Solomon and fountain 
codes, assuming p,=0.015, Peay, Prax=1075, f=7 fragments, and s in [10,15] 


e Pp”: the minimum probability of failure fixed by the data market based on its economic 
availability, which corresponds to the maximum number of slices the market can afford. 


The data market does not react every time a node leaves or re-joins the DCS, but only if this 
event causes P, to become higher than P"“* or smaller than P", If P, exceeds P’"“*, the data 
market generates a new slice and allocates it to a new node. If P, goes below P”"'", the data market 
can terminate the contract with one of the nodes in the system (e.g., with the one that has been 
off-line the most) and stop paying for its services. 

Consider, as an example, a resource partitioned in f=7 original fragments and encoded in s=10 
slices allocated at nodes with probability p,=0.015 of failure. Figure [2.11] compares the proba- 
bility P, of unavailability of a resource when using Reed-Solomon and fountain codes, assuming 
P™in=107!? and P“=1073, and varying the number of nodes that the data market identifies as 
unavailable between O and 9. The error bars in the figure represent the standard deviation. Note 
that P, is initially the same for Reed-Solomon and fountain codes, and evolves in the same manner 
when nodes leave the DCS. The evolution of P, when a node re-joins the DCS is instead consider- 
ably different: with Reed-Solomon it causes duplication of a slice, with fountain codes it implies 
the availability of an additional (different) slice. As visible from the figure, the probability that 
the resource becomes unavailable is higher using Reed-Solomon than using fountain codes. In 
fact, the nodes storing slices available in a single copy (e.g., sa in Figure [2.10{a)) play a critical 
role in resource reconstruction. Indeed, when failing, they cannot be immediately substituted. 
With fountain codes, all nodes are equally critical since they all store different slices that can be 
interchangeably used to reconstruct the resource. 

The top chart in Figure [2.12] illustrates an example of evolution of P,, assuming the adoption 
of fountain codes, considering the corrective actions taken by the data market. The value of P, 
grows every time a node leaves and decreases every time a node re-joins the network. As long as 
P, is between P"" and P’"“* (the two red dashed lines in the figure), the data market does not react 
to node leave and re-join events. In the example, we set P""" and P"“* in such a way to tolerate 
the leave and re-join of one node at a time. When P, reaches Py'“* as a consequence of the failure 
of a node (red triangles 1, 2, and 4 in the figure, indicating unavailability of two nodes), the data 
market generates and allocates a new slice. The generation of a new slice reduces P, to a value 
below P"“* (green circle 1, 2, and 4 in the figure). When P, reaches P""" as a consequence of node 
re-join (red triangle 3 in the figure, indicating re-join of all the three nodes that failed), the data 
market closes the contract with one of the nodes and P, returns above the threshold (green circle 3 
in the figure). 
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Figure 2.12: An example of evolution of P, (top chart) and P, (bottom chart) as a consequence of 
nodes leaving and re-joining the DCS and of actions taken by the data market 


In the next section, we will illustrate that P, decreases not only when the data market creates a 
new slice, but also when the resource is re-encrypted. 


2.7.2 Security guarantees against malicious coalitions 


Since f slices are sufficient to reconstruct the resource, any coalition of at least f malicious nodes 
can reconstruct the encrypted content of the resource (and its plaintext representation if the key is 
exposed) and/or prevent its deletion by retaining their local copy of the slice. The probability that 
f (or more) nodes collude is P,=Y'5_ f (5) Pe (1 — p¿)97*. The probability P, that f nodes collude 
increases whenever the data market generates a new slice and allocates it to a new node. On the 
contrary, P, never decreases. Indeed, we cannot assume that nodes leaving the system will not 
re-join in the future and that they do not have a copy of the slice initially allocated to them, even in 
case the data market closed the contract. A malicious node can keep a copy of the data on purpose, 
to prevent resource deletion and possibly sell the slice to non-authorized subjects. 

The data market can reduce P, only by asking the resource owner to re-encrypt the resource 
with a new key. This process is however quite expensive as it requires to locally reconstruct the 
resource (downloading f slices), decrypt it, re-encrypt its plaintext content with a new encryption 
key, encode the resulting ciphertext, and distribute the new set of s slices to s nodes. To limit 
such overhead, our solution is based on the definition of a threshold P”®™ for P., representing the 
maximum probability that the data market (and resource owner) can tolerate that the resource is 
exposed. When, as a consequence of the generation of a slice, P, exceeds P!”"“*, the data market 
can require the resource owner to re-encrypt the resource. Clearly, slices generated before resource 
re-encryption cannot be used together with slices generated after re-encryption for resource recon- 
struction, hence nullifying possible misbehaviors by nodes whose contract has been closed before 
re-encryption. 

The bottom chart in Figure [2.12] illustrates the evolution of P, due to corrective actions taken 
by the data market for keeping P, below Py'”* (1.e., the generation of new slices). As long as P, 
remains below threshold P:"“, no corrective action is taken (green circles 1 and 2 in the figure). 
In the example we set P’”“* to tolerate the generation of two slices. When the generation of a new 
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slice causes P, to reach P/"“% (red triangle 5 and green circle 4 in the figure), the market requires 
the resource owner to re-encrypts the resource. Resource re-encryption resets both P, and P, to 
their initial value (green circle 5 in the figure). The lower is P’”“*, the more frequently slices will 
be generated and the more frequently the resource owner will need to re-encrypt the resource. 


2.8 Summary 


This chapter focused on the problem of resource storage in the data market and considered the 
opportunity of relying on DCS services. The approach illustrated in this chapter enables the data 
market to protect resources and to control their decentralized allocation to different nodes in the 
network. This chapter investigated different strategies for splitting and distributing resources, 
analyzing their characteristics in terms of availability and security guarantees. Also, it provided a 
modeling of the problem enabling owners to control the granularity of slicing and diversification 
of allocation to ensure the aimed availability and security guarantees. 

MOSAICrOWN is now studying an AONT encryption mode enhancing Mix&Slice technique 
to improve its performance, by reducing the number of mixing phases necessary to encrypt a 
macro-block. 
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3. Data wrapping for controlled sharing 


In this chapter, we address the problem of enabling controlled data sharing in the data market. In 
particular, we consider the key requirement of enabling data owners to remain in control of their 
data, granting selective access to requesting processors (i.e., entities that are interested in accessing 
portions of the data made available by owners) without the need of delegating the management of 
access requests to the data market itself. Our solution is based on the controlled adoption of owner- 
side encryption, by means of which data are wrapped, before being outsourced to the market, with 
an encryption layer. To enable selective access, different data items are encrypted with different 
encryption keys, which are made known to authorized subjects. To reduce the overhead of key 
management, we leverage key derivation techniques. We also consider the possibility to provide 
incentives to data owners for making their data available in the market, and identify and limit 
possible misbehaviors occurring with (malicious) entities when dealing with such incentives. 

The remainder of this chapter is organized as follows. Section[3. I|discusses the state of the art 
and highlights MOSAICrOWN innovations. Section [3.2] introduces the reference scenario along 
with its specific requirements. Section [3.3]discusses some needed preliminaries and the building 
blocks over which our solution builds. Section[3.4]illustrates our proposal for protecting resources 
and granting selective access. Section [3.5] presents our approach for managing incentives to data 
owners and counteracting possible misbehaviors. Section [3.6] discusses some relevant features of 
our solution, and Section|3.7|concludes the chapter. 


3.1 State of the art and MOSAICrOWN innovation 


In this section, we illustrate the state of the art and the innovations produced by MOSAICrOWN 
for controlled sharing of data stored in the market, ensuring that owners remain in control of their 
data. 


3.1.1 State of the art 


The enforcement of access restrictions to resources outsourced to external storage and manage- 
ment platforms has been extensively studied in the literature. Recent approaches have explored 
the adoption of selective owner-side encryption (e.g., [ABFF09|[DFJ* 10|BDF* 16]), wrapping re- 
sources with a layer of encryption before outsourcing and distributing keys to authorized subjects, 
in such a way that all and only authorized subjects can decrypt resources. Such an approach re- 
lieves the need of having a trusted party in charge of managing access requests and grant/deny 
access based on the authorization policy. None of the approaches in this line of work consider the 
peculiarities of the data market scenario, including the possibility of rewarding owners. 

Recently, the adoption of distributed ledgers (such as blockchain) and of smart contracts has 


been investigated for exchanging resources and/or enforcing access control (e.g., | DEF18/DMR17 


[DPNH20]|ZNP15]). The proposal in [KAS* 18] puts forward a solution for 
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auditable sharing of private data on blockchains, assuming a (collective) authority for enforcing 
access restrictions. The proposal in aims at protecting resources through the use of en- 
cryption, without considering incentives to data owners and possible misbehaviors that can arise 
in this context. The problem of trading data has been investigated in [DPNH20], which how- 
ever adopts Bitcoin transactions and does not support fine-grained authorizations. The approach 
in focuses on ensuring fairness of resource exchange, without explicit reference to data 
markets and their need for fine-grained authorizations. A blockchain-based access control model 
has been proposed in [DMR 17], which enables a processor to transfer her access rights to another 
processor, without ensuring incentives to the owners. The proposal in protects data using 
encryption and on-chain storage of pointers to data. 


3.1.2 MOSAICrOWN innovation 


MOSAICrOWN produced several advancements over the state of the art, which are discussed in 
this section. 


e The first innovation is represented by a technique enabling the enforcement of authorization 
policies regulating access to resources in the data market scenario, based on specific adap- 
tations of selective encryption and key derivation strategies, considering the possibility of 
granting incentives to owners for making their resources available on the market. 


e The second innovation relates to the identification and characterization of possible mis- 
behaviors that can arise when dealing with incentives to data owners, and the interacting 
subjects that do not fully trust each other. 


e The third innovation is represented by the definition of a solution for counteracting the 
misbehaviors that might arise in the considered scenario and for incentivizing all parties to 
behave correctly, based on the combined adoption of blockchain and smart contracts, and 
on a simple auditing protocol for identifying the subject(s) who misbehaved. 


The results obtained by MOSAICrOWN and illustrated in this chapter have been published 
in [DFLS191. 


3.2 Scenario and requirements 


We address the problem of allowing subjects to leverage the availability of data market platforms 
for selectively sharing their data with interested processors (i.e., entities that need to perform some 
processing on them). Our scenario is characterized by a set (={01,...,0,} of data owners on one 
side, and a set P={p1,...,Pm} of processors on the other side, who interact through the market 
platform to trade data, modeled as a generic set Z={r,...,7n} of resources. Such a scenario 
entails a number of challenges and requirements that need to be carefully addressed. In particu- 
lar, we investigate two main issues characterizing the data market scenario: i) ensuring adequate 
data protection, and ii) ensuring that owners and processors can profitably leverage data market 
platforms. Our first goal aims at maintaining the owner in control of her data, selectively granting 
access to resources according to her needs and wishes. As for the second goal, we consider and 
manage incentives that encourage owners to contribute with their data. While several incentives 
can nicely fit the data market scenario, in this work we consider incentives based on monetization, 
according to which a data owner granting access to one of her data items obtains a reward in terms 
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Request no. | Processor | Requested Resources 
1 W a,b,c 
2 x c,d,e,f 
3 y a,b,c 
4 Zz b,c 
5 Zz a,d,e,f 
6 W d,e,f 


Figure 3.1: An example of a sequence of six access requests posed by four processors for subsets 
of six resources 


of a payment. Hence, due to the involvement of remuneration incentives and since data owners 
and processors might not fully trust each other, our second goal also requires to define mecha- 
nisms for counteracting possible misbehaviors (e.g., a malicious owner does not provide access 
to a resource for which she received an incentive or, conversely, a processor claims her money 
back by maliciously declaring the owner did not grant access despite the provided incentive, see 
Section [3.5.2] form more details). In designing our solution, we keep the following requirements 
in mind: 


R1. the content of published resources must remain protected, and only authorized processors 
can access their content; 


R2. the data owner must be aware of which processors have access to which resources; 
R3. the owner cannot claim that an incentive has not been received while it actually has; 


R4. after the owner has granted to a processor access to a resource, the processor cannot claim 
that access has not been granted (and ask to be refunded the paid incentive). 


While the first two requirements deal with ensuring data protection, the latter two reduce the 
possibility of misbehaviors from both the data owner and the processors. 

In the remainder of this chapter, we refer our examples to a set Z={a,b,c,d,e,f} of six 
resources uploaded on the market platform. Such resources are of interest for four processors 
P=4u,x,y,z), which over time buy access to resources according to the sequence in Figure[3.1| 


3.3 Preliminaries 


Our solution combines four building blocks: i) selective owner-side encryption; ii) key derivation; 
iii) blockchain; and iv) smart contracts. 


Selective owner-side encryption. Selective owner-side encryption consists in encrypting, at the 
owner side, different resources with different keys, and in distributing keys to processors in such a 
way that each processor can decrypt all and only the resources she is authorized to access 
[DFLS16]. Since the encryption layer is provided at the owner side, resources self-enforce the 
access restrictions defined over them and their content is protected also to the eyes of the data 
market. The data owner can then make her resources available to the market platform (which 
can be hosted, for instance, on the cloud), with the guarantee that only processors knowing the 
encryption keys will be able to decrypt resources. A straightforward solution to enforce access 
restrictions through selective encryption consists in encrypting each resource with a different key, 
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a) (1) i 


From | To Value 
a c | ke@h(ka,lc) 


(c) b | c | ke@h(kp,le) 


Figure 3.2: An example of key derivation structure and token catalog 


and in distributing to each processor the keys of the resources in her capability list. However, this 
practice would imply a considerable key management burden for processors, who would need to 
manage one key for each resource for which they are authorized. To mitigate such overhead, we 
adopt key derivation, as follows. 


Key derivation. Key derivation permits to derive the value of an encryption key k, from the 
knowledge of another encryption key kx and of a public label /, (i.e., a piece of information) 
associated with k, [ABFFO9||DFJ* 10}. The derivation of k, from k, is enabled by a public token 
tx, y Computed as kyOh(kx, ly), with @ the bitwise xor operator, and h a deterministic non-invertible 
cryptographic function. The derivation relationship between keys can be direct, via a single token, 
or indirect, through a chain of tokens. Key derivation structures can be graphically represented 
as directed acyclic graphs, where vertices represent encryption keys (and their labels), and edges 
represent tokens. Tokens are physically stored in a public catalog 7. Figure [3.2] illustrates an 
example of derivation among three keys ka, kp, and ke, and the corresponding token catalog 7. 
For simplicity, in our examples we use x to denote the label of key k,, and use the label of a key 
to denote the corresponding vertex (e.g., vertex a in Figure [3.2]represents key ka and its label). In 
the following, when clear from the context, we will use the terms keys and vertices (tokens and 
edges, respectively) interchangeably. 


Blockchain. A blockchain is a shared and trusted public ledger of transactions, maintained in 
a distributed way by a decentralized network of peers. Transactions are organized in a list of 
blocks, linked in chronological order, where each block contains a certain number of transaction 
records and a cryptographic hash of the previous one. Each transaction is validated by the network 
of peers, and is included in a block through a consensus protocol. The state of a blockchain is 
continuously agreed upon by the network of peers: everyone can inspect a blockchain, but no 
single user can tamper with it, since modifications to the content of a blockchain requires mutual 
agreement. Once a block is committed, nobody can modify it: updates are reflected in a new block 
containing the new information. This permits to trust the content and the status of a blockchain, 
while not trusting the single peers. 


Smart contracts. Smart contracts are a powerful tool for establishing contracts among multiple, 
possibly distrusting, parties. A smart contract is a software running on top of a blockchain and de- 
fines a set of rules, on which the interacting parties agree. It can be seen as a set of ‘if-then’ instruc- 
tions, defining triggering conditions and subsequent actions capturing and formalizing the clauses 
of a contract to be signed by the parties. The execution of a smart contract can be trusted for cor- 
rectness thanks to the underlying blockchain consensus protocols, meaning that all the conditions 
of the agreement modeled by the contract are certainly met and validated by the network. How- 
ever, smart contracts and their execution lack confidentiality and privacy, as plain visibility over 
the content of a contract and over the data it manipulates is necessary for validation ([CZK* 19]. 
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With reference to the requirements illustrated in Section[3.2] we leverage selective encryption 
and key derivation to satisfy R/ and hence for protecting resources granting access according to 
an authorization policy defined and controlled by the data owner (Section B.4). In this way, the 
authorization policy is automatically enforced by wrapping a layer of encryption on resources be- 
fore being published and distributing encryption keys to authorized users only. We then leverage 
Blockchain and smart contracts to satisfy R2 and hence for managing the interactions between 
processors and the data owner, as well as access requests, by keeping an un-modifiable and un- 
deniable record of the granted access rights (Section B.5). Finally, we satisfy R3 and R4 by coun- 
teracting misbehaviors through an audit process that incentivizes all parties to behave correctly. 


3.4 Protecting resources 


In this section, we illustrate how to enforce requirement R/ to guarantee that the content of each 
resource is visible only to authorized processors. For simplicity, but without loss of generality, 
we consider the set of resources published by one owner o, with the note that the same reasoning 
applies to each data owner operating on the data market. In line with our assumption of remuner- 
ative incentives, we assume for simplicity but without loss of generality that the data owner does 
not pose access restrictions to her data except from the fact of receiving a payment. Our proposal 
can however be easily extended to consider additional access conditions. 


3.4.1 Authorization policy and key derivation structure 


We represent the authorization policy < by means of the capability lists of the processors in 
P, where cap(p) represents the set of resources for which processor pe Y is authorized (and, 
possibly, for which the owner o has received an incentive from p). Every time processor p 1s 
granted access to a resource rE% (and, as illustrated in the remainder of this chapter, has paid the 
incentive to the owner), r will be added to her capability list (i.e., cap(p)=cap(p)U{r}). 

To allow fine-grained access control as demanded by our scenario, without the intervention 
of the data owner to mediate each access request and without the need to trust the data market to 
properly enforce access privileges, we leverage selective encryption [DFJ* 10}. Since encrypted 
resources self-enforce the access restrictions defined over them (see Section B.3), the data owner 
can physically outsource her published resources to the market platform (which can be hosted, for 
instance, on the cloud), with the guarantee that only processors provided with the encryption keys 
will be able to decrypt resources. A straightforward solution to enforce access restrictions through 
selective encryption consists in encrypting each resource with a different key, and in distributing to 
each processor the keys of the resources in her capability list. However, this practice would imply 
a considerable key management burden for processors. To mitigate such overhead, we adopt key 
derivation (see Section B.3). Intuitively, each processor p agrees a key k, with the data owner, 
who publishes a token allowing p to compute k, from k, for each r in cap(p) (if a same processor 
can access resources from different owners, she can create a set of tokens enabling her to compute 
the key shared with each owner from a single (secret) master key). While effective, this simple 
solution could not be efficient, for instance since it might create more tokens than necessary. A 
possible optimization is to keep the number of tokens under control and, to this end, the key 
derivation structure is typically enriched with additional vertices whose key is used for derivation 


only [DFJ*10|DFLS16]. While in cloud-based scenarios such additional vertices are typically 


associated with groups of users, the considered scenario would benefit from the definition of keys 
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Processor | Capability List 


W a,b,c,d,e,f 
x c,d,e,f 

y a,b,c 

Zz a,b,c,d,e,f 


(a) 


Figure 3.3: An example of authorization policy (a), and of a key derivation structure enforcing it 


(b) 


associated with groups of resources. Indeed, such an approach guarantees that each resource 
r has a different encryption key (the one corresponding to singleton set {r}). Also, the data 
market entails a dynamic scenario, with processors joining the system accessing (and possibly 
paying incentives for) subsets of resources, and it is therefore natural to think in terms of groups 
of resources in contrast to groups of users. 

Formally, a key derivation structure is defined as follows. 


Definition 3.4.1 (Key derivation structure) Given a set 2={rj,...,1n} of resources and a set 
P=4Pp1,...,Pmy of processors, a key derivation structure over 2 and F is a directed acyclic 
graph G(V,E) such that: 


1. Wv, EV, (xE F)V(xC&); 
2. Vp € P, vp EV and Vr € &, v, E V; 


3. W(vz,Vy) EE : (yx) V(xEP Ny C ZB). 


According to the definition above, vertices in the key derivation structure represent processors 
or sets of resources (Condition 1). Also, the derivation structure has a vertex for each processor and 
for each resource (Condition 2). Vertices representing processors have only outgoing edges ending 
at vertices representing (sets of) resources, while edges connecting sets of resources satisfy the 
subset containment relationship, that is, each vertex is connected to vertices representing subsets 
of its resources (Condition 3). Each processor p knows the key of its vertex v, and each resource 
r is encrypted with the key of its vertex v,. With reference to our running example defined over six 
resources and four processors, Figure[3.3[b) illustrates an example of key derivation structure with 
a vertex for each processor (gray), a vertex for each resource, and additional vertices for subsets 
of resources. As already noted, for simplicity in the figures we denote each vertex v, with x (e.g., 
a is the vertex for resource a and x is the vertex of processor x.) 

A key derivation structure correctly enforces an authorization policy .Y iff it allows each pro- 
cessor to derive all and only the keys used to encrypt the resources that she is authorized to access, 
meaning that each processor must be able to reach, starting from its vertex, all and only the vertices 
representing the resources in her capability list. 
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Definition 3.4.2 (Correctness) Given an authorization policy & over a set & of resources and a 
set P of processors, a key derivation structure G(V,E) over & and 2 correctly enforces .4 iff: 


VrERNPES : recap(p) > da path in G from v, to vr. 


To correctly enforce the authorization policy æ, we include in the key derivation structure a 
vertex for each capability list of the processors in 4 and an edge to connect the vertex v, of 
each processor to the vertex vcap(p) representing her capability list cap(p). If needed to further 
reduce the number of tokens, we insert additional vertices representing subsets of resources even 
if not corresponding to any capability list. We then connect vertices representing sets of resources 
in such a way to ensure that the set R of resources represented by a vertex vg is covered by the 
vertices directly reachable from vg through an edge in G (meaning that the union of the sets of 
resources represented by the vertices directly reachable from vp is exactly R). 


To illustrate, consider the authorization policy in Figure B.3[a) for the sets of resources and 
processors of our running example. Figure B.3[b) illustrates an example of a key derivation struc- 
ture enforcing the policy. The structure includes one vertex for each resource, one vertex for each 
processor, and three vertices representing their capability lists. Edges of the structure connect 
processors to their capability lists, and sets of resources according to the subset containment re- 
lationship, in such a way to guarantee coverage (e.g., vertex abc is covered by a, b, and c). It is 
immediate to verify that the key derivation structure in Figure B.3[b) correctly enforces the autho- 
rization policy in Figure[3.3[a), since it enables each processor to reach all and only the resources 
she is entitled to access. 


3.4.2 Resources and access management 


The key derivation structure is updated whenever new resources are published and/or new access 
privileges are granted. The publication of resource r is easily enforced by simply inserting in the 
structure a vertex for r. The owner generates an encryption key k, and a label /, for the vertex, 
encrypts r with k,, and publishes the encrypted resource on the market. To illustrate, consider the 
structures in Figure [3.4] The first structure (Figure B.4[a)) represents the structure for our running 
example after the publication of the six resources (since no authorization has been granted yet, no 
processor vertex is present). 

Granting access for a set R of resources to processor p is enforced by procedure Grant_Access 
(Figure B.5). Note that, in the figure and in the following discussion, the generation of vertices 
(edges, resp.) implies the generation of their keys and labels (tokens, resp.). The procedure takes 
as input the requesting processor p, its current capability list cap(p), the set R of resources she 
is to be granted access, and the key derivation structure G(V,E£). It updates the structure enabling 
p to derive the keys necessary to decrypt the resources in cap(p)UR. The procedure fist checks 
whether the structure already contains a vertex v, for p (i.e., if p can already access resources in 
the market) and, if this is not the case, it creates vertex v, and the corresponding key and label 
(lines 1-2). If the vertex vcap(p) representing the (old) capability list of p does exist, the procedure 


deletes (Vp, Veap(p)) (lines 3-4). It then checks whether the removal of v¿ap(p) could reduce the 


p 
number of edges |DFJ* 10] and, if this is the case, it removes Vcap(p) Connecting all its parents to 
all its children (lines 5-13). The procedure then updates the capability list cap(p) by including the 
resources in R (line 14). If the key derivation structure already includes vertex V¿ap(p) representing 


the new capability list of p, then vp is simply connected to Vcap(p) and the procedure terminates 


p) 
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© Q 
Ccdef) m Cedef) 
o o == 


GOO SOOWSE GOVOVSE 


(a) resource publishing (b) w is granted access to abc, x is granted access to cdef (c) y is granted access to abc 


(d) z is granted access to bc (e) z is granted access to adef (£) w is granted access to def 


Figure 3.4: Evolution of a key derivation structure 


(line 15). Otherwise, Vcap(p) first needs to be created, and only at this point an edge is created 
to enable the derivation of Veap(p) from vp (lines 16-19). To guarantee the correctness of the key 
derivation structure, all the resources in cap(p) should be reachable from Veap(p) (Definition|3.4.2). 
Hence, the procedure identifies the set Desc of vertices representing subsets of resources in cap(p), 
and selects a subset Cover of vertices in Desc forming a set cover for cap(p) (lines 20-21). Vertex 
Veap(p) 18 connected to the vertices in Cover (line 22). The procedure finally checks if it is possible 
to further reduce the number of edges thanks to the presence of V¿ap(p)- To this aim, 
the procedure identifies the set Par of the vertices representing supersets of cap(p), and the set of 
DescCover of the vertices reachable from a vertex in Cover (lines 23-24). Indeed, if a vertex Vpar 
in Par is directly connected to more than one vertex in Cover and/or in DescCover, the insertion 
Of Vcap(p) AS an intermediate vertex and removal of edges from vpar to the vertices in Cover and/or 
in DescCover (lines 25-31) reduces the number of edges. 


Figure[3.4]illustrates the evolution of the first structure (Figure B.4[a)) to enforce the sequence 
of requests illustrated at the bottom of the structures. The first two requests insert vertices abc 
and cdef for w and x respectively (Figure B.4[b). The third request does not insert vertices for 
resources since cap(y)=abc already belongs to the structure (Figure B.4[c). The fourth request 
inserts bc for z. Note that connecting abc to bc saves an edge (Figure 3.4{d)). The fifth request 
inserts a vertex for the entire set Z, for which z is authorized. Vertex bc then becomes redundant 
and is removed (Figure B.4fe)). The last request authorizes w for all the resources. Since cap(w) 
belongs to the structure, no vertex is inserted (Figure B.4[1)). 
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GRANT_ACCESS(p,cap(p),R,G(V,E)) 
1: if Vp g V then 

z generate vp; V := V U {vp} 

3 If Veap(p) € V then 

+ E :=E \ {(Vp,Veap(p)) t 

s if Ap'Aps.t. cap(p) =cap(p’) then 


6: let par be the number of incoming edges of Vcap(p) 
7: let desc be the number of outgoing edges of V¿ap(p) 
8: if (parxdesc) < (par+desc) then 

9: for each VparEV : (Vpar,Vola) EE do 

o E :=EN {(Vpar Vota) 

I: for each vzesc EV : (Vold, Vdesc) EE do 

2: E:=E 7 { (Vold, Vdesc) } U {par Vdesc) } 

3: V := VWVeap(p)) 


« cap(p) :=cap(p)UR 

5 if Veap(p) E V then E£ := E U {(Vp, Veap(p))) 

6: else 

t generate Vcap(p) 

8: V := V U {veap(p)} 

2: E := E U { (Vp, Vcap(p))) 

20: let Desc C V be the set of vertices over a set of resources C cap(p) 


21: let Cover be a subset of Desc whose resources form a set cover for cap(p) 
2: for each vcover € Cover do E := E U ((Vcap(p)> Veover) Y 
23: let Par C V be the set of vertices over a set of resources > cap(p) 


24: let DescCover be the set of vertices reachable from vertices in Cover 
25: for each v,,, € Par do 


26: ToRemove := 0 

27: for each v € DescCover U Cover do 

28: if (Vpar,v) € E then ToRemove:= ToRemove U{ (V par, v) } 
29: if |ToRemove| > 2 then 

30: E := EU [(V par, Veap(p))) 

31: for each (var, vz) €ToRemove do E := EN { (Vpar, Vz) } 


Figure 3.5: Procedure managing access grants 


3.5 Counteracting misbehaviors 


When granting an authorization implies an incentive to be paid to the data owner, data owners and 
processors should trust each other, like in any situation where a vendor sells a product or a service 
(access to resources, in our case) to a buyer. If access to a resource is granted before payment, 
the owner needs to trust the processor to finalize the payment. If access is granted after payment, 
the processor needs to trust the owner to grant access to the resource(s) for which she paid the 
incentive. Requiring such level of trust is unrealistic in our scenario, and processors and owners 
might misbehave to get advantages. We then need a solution enabling them to conclude a contract 
without fully trusting each other. In this section, we first illustrate possible misbehaviors that a 
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malicious party could adopt to gain advantages over the counterpart. We then present our solution 
to mitigate these risks. 


3.5.1 Malicious behaviors 


In principle, both the data owner and the processors might get advantages in behaving maliciously. 
The data owner might not grant access to her resources after having received a payment, with eco- 
nomic/privacy advantages. The processor might instead not pay after having received access to a 
resource, with economic/knowledge advantages. Also, a malicious party can blame a misbehavior 
on the other (honest) party (e.g., a malicious owner could claim a payment has not been executed 
and require a new payment). The main misbehaviors can be classified as follows: 


e NO_ACCESS: a malicious data owner does not grant access to a processor p for (at least) a 
resource for which p paid the agreed amount; 


e NO_PAYMENT: a malicious processor does not pay to the data owner o the agreed amount 
for (at least) a resource for which o provided access; 


e NO_ACCESS*: a malicious processor claims that, for (at least) a resource for which she paid 
the agreed amount, access has not been provided (while 1t has); 


e NO_PAYMENT*: a malicious owner claims that, for (at least) a resource for which she 
granted access to a processor, payment has not been finalized (while it has). 


Misbehaviors related to payments (i.e., NO_PAYMENT and NO_PAYMENT*) can be easily pre- 
vented by adopting blockchain and smart contracts, granting access to a resource upon the recep- 
tion of a money transfer from a processor. A straightforward solution could consist of directly 
trading the encryption keys with a smart contract which, upon receiving a payment from a proces- 
sor p for a set R of resources, triggers the algorithm illustrated in the previous section to automat- 
ically generate keys, labels, and tokens enabling p to access all resources in R. Unfortunately, this 
solution is not viable due to the public nature of the content of smart contracts. Updating the token 
catalog requires in fact knowledge of the keys used in the system (including those assigned to 
processors and those used to encrypt resources), and hence any subject observing the blockchain 
would be able to decrypt the resources. We now illustrate our solution to this problem. 


3.5.2 Counteracting approach 


We propose an interaction protocol, to regulate the interplay between processors and data own- 
ers, and an audit process, to identify misbehaving parties. The interaction protocol prevents 
NO_PAYMENT and NO_PAYMENT* misbehaviors, ensuring that if an incentive must be paid to the 
owner for accessing some of her resources, then no access can be granted without such payment. 
The audit process detects NO_ACCESS and NO_ACCESS* misbehaviors, exposing malicious own- 
ers (processors, resp.) that do no grant access despite having received the incentive (maliciously 
claim an access for which they paid the incentive has not been granted, resp.). The combined 
adoption of these two approaches incentivizes all parties to behave correctly. 


Interaction protocol. The interaction protocol relies on smart contracts to regulate how a proces- 
sor p and a data owner o should operate to safely finalize the payment of the incentive following 
an access grant to a set R of resources. Its adoption guarantees that the data owner receives the 
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Figure 3.6: Interaction protocol 


incentive and the processor receives a public commitment by the data owner to grant access to 
R. Since encryption keys cannot be directly managed through smart contracts, we leverage smart 
contracts and blockchains only to enforce the payment of the incentive, and to securely log the 
willingness of o to grant p access to the requested resources after the payment is received. An 
executed smart contract then represents an incontrovertible proof that: i) the payment has been 
performed; and ii) the data owner is aware of her obligation to give p access to the resources. We 
then complement smart contracts with a solution (i.e., the audit process) that enables a designated 
trusted subject (i.e., an auditor) to check, upon request (e.g., when one of the two parties detects 
or suspects a misbehavior), whether the accesses dictated by the contract are actually provided 
(i.e., whether the owner did what she committed to). To enable such control, within the interaction 
protocol we: ¿) store on-chain the catalog .7 and the capability lists of the processors (so that they 
can be queried in the audit); and ii) require the encryption key k, provided by o to p to be signed 
by p (so that it can be proved to be authentic in the audit). We then assume each processor to have 
a private (priv), public (pub) key pair. 

The interaction between a processor p, willing to purchase a set R of resources, and the owner 
o of the resources operates according to the protocol in Figure [3.6] Note that we consider the case 
in which o is willing to grant p access for R: if this is not the case and the owner does not wish to 
grant access, she simply does not participate in the protocol, and the interaction terminates. The 
protocol operates as follows: 


1. p contacts o (off-chain) communicating the set R of resources for which she requests access 
and for which she is willing to pay the incentive to o; 


2. if p and o have not yet interacted, o generates a key k, for p and sends it to p; 


3. upon receiving kp, p signs it with her secret key priv), and sends the signed key (denoted 


[kp] priv, ) to o; 


4. o prepares a smart contract, dictating that “upon receiving incentive from p for R, 
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cap(p) := cap(p) UR, and the token catalog 7 is updated in such a way to allow p to 
derive the keys for the resources in cap(p)”; 


5. o deploys the contract on the blockchain; 
6. p accesses the contract and signs it, automatically triggering the payment for incentive; 
7. o updates the key derivation structure as required by the executed contract (see Section|3.4.2); 


8. o updates .7 on the blockchain. 


Since 7 is stored on-chain, at the end of the interaction protocol p can query it for obtaining 
the information necessary to derive the keys for the resources in R. Also, since the key derivation 
structure is updated by the owner on her premises, keys are kept safe. Note that, to enable the 
audit process (as clarified in the remainder of this section), tokens operate on signed keys [kp] priv, 
(in contrast to kp). 

If an interested processor and a data owner interact through this protocol, both NO_PAYMENT 
and NO_PAYMENT* misbehaviors are prevented. In fact, the key derivation structure is updated 
locally by the owner granting the processor access to resources only after the payment has been 
received. Since the blockchain is public, every user can verify whether the payment has been 
performed. 


Audit process. Since accesses are directly granted by the data owner and keys and resources are 
not exchanged on-chain, NO_ACCESS and NO_ACCESS* misbehaviors cannot be prevented. We 
then propose an audit process for detecting and exposing them (hence negatively impacting on the 
reputation of the misbehaving party). To this end, our audit process allows a designated trusted 
auditor, arbitrarily agreed between the owner and the processor (and possibly identified in the 
smart contract, so to have a proof that both parties agree on it), to check whether the processor 
does have access to all the resources for which she paid the incentive. The audit process can be 
invoked either by a processor claiming and wishing to expose a NO_ACCESS misbehavior, or by 
an owner claiming and wishing to expose a NO_ACCESS* misbehavior. 

Given the identity of the processor p and of the owner o involved in the audit process, the 
auditor checks whether the current token catalog .7 enables p to derive the keys for the resources 
in cap(p) to discriminate between NO_ACCESS and NO_ACCESS*. Figure[3.7]illustrates the audit 
process. The auditor first needs to query the public ledger maintained on-chain to obtain: the capa- 
bility list cap(p) of the processor, the token for deriving key Kcap(p)» and label l¿ap(p)- If the token 
(or the label) does not exist, the auditor signals a NO_ACCESS misbehavior. In fact, the derivation 
structure cannot allow p to derive the keys for the resources in cap(p) (i.e., the owner ignored the 
incentive). Otherwise, the auditor retrieves [kp] priv, from the owner and, using the catalog, derives 
all the encryption keys reachable from |kp|priv,. To verify that [kp]priv, is the key that p and o 
exchanged in the interaction protocol (Figure B.6), the auditor checks its signature. If signature 
verification fails, either o has defined/updated the derivation structure starting from the wrong key 
(and hence p cannot access the resources in her capability list), or it is not participating honestly 
in the audit process (i.e., she returned a different key from the one agreed with p). In either case, 
the auditor signals a NO_ACCESS misbehavior, exposing a misbehavior of the owner. On the con- 
trary, if signature verification succeeds, the auditor derives Xcap(p) and the keys of the resources in 
cap(p). The auditor can then try to decrypt all resources in cap(p) using these keys. If decryption 
fails (because for at least one resource the related key is not derivable or incorrect), the auditor 
again signals a NO_ACCESS misbehavior. Otherwise, if decryption succeeds, the auditor returns 
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AUDIT(p.o) 
 retrieve cap(p) 
2 retrieve t, cap(p) and lcap(p) from 7 

3 if t, cap(p) = NULL Or [cap(p)=NULL then 

4 return ‘NO_ACCESS misbehavior’ 

s: retrieve [k,| priv, from o 

s if signature verification of |kp|priv, fails then 


, 


7. return ‘NO_ACCESS misbehavior / o not collaborating 
s compute Kcap(p) = M([kp| priv, > lcap(p)) ® tp.cap(p) 

o: derive all the resource keys k, derivable from k¿ap(p) 

10: for each r € cap(p) do 

u: r:=deerypt(E(r),k,) 

1: if decryption fails then 

13: return ‘NO_ACCESS misbehavior’ 

14. return ‘NO_ACCESS* misbehavior’ 


Figure 3.7: Pseudocode of the audit process 


a NO_ACCESS* misbehavior, since the owner respected her obligation and hence the processor is 
dishonestly accusing the owner of misbehavior. Note that the audit process could expose the plain- 
text content of resources to the auditor, when the processor maliciously accuses the data owner of 
misbehavior. However, a malicious processor can disclose the purchased resources to any subject 
(hence including the auditor) independently from the audit process, so the process itself does not 
introduce additional disclosure risks. Also, thanks to our audit process, the misbehavior of the 
processor is revealed. Therefore, we expect processors not to dishonestly blame a misbehavior on 
an honest owner, as this would decrease their reputation. 

The reliability of the results of the audit process clearly depends on the correctness and fresh- 
ness of the data over which controls operate (i.e., tokens, labels, capability lists, processors’ signed 
keys). The correctness and freshness of tokens, labels, and capability lists is guaranteed by the fact 
that they are stored on-chain, as dictated by the smart contract, and hence in a safe and immutable 
ledger that the auditor can query. The correctness and freshness of [kp] priv, is guaranteed by the 
digital signatures, since o cannot reproduce (nor p repudiate) a signature with privy. 

The availability of the audit process clearly incentivizes both the data owner and the proces- 
sors to behave correctly. Indeed, the audit process reveals misbehaviors and publicly exposes the 
identity of the malicious subject. This can have serious consequences on her reputation, with clear 
damages in the data market where (in a similar way to, for instance, e-commerce platforms) lower 
reputation can be expected to cause lower willingness of other parties to engage in interactions. 
We then expect the availability of our audit process to prevent misbehaviors. 


3.6 Discussion 
The combined adoption of selective encryption and of an approach based on blockchain, smart 


contracts, and an audit process for regulating the interactions between data owners and processors 
can set a first step towards the enforcement of transparent processing. Transparent processing of 
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data implies, among other aspects, to log all events related to data processing and sharing, and to 
enable the control that the processing itself is performed according to the policy set by the data 
owner [BKPW17]. We fulfill these requirements by ensuring that: i) each resource is shared only 
with processors authorized by the owner; and ii) sharing is securely logged on-chain, producing 
a verifiable trail of sharing history. This ensures that the data owner knows and can prove, at any 
time, who is able to access her resources, and that each processor is able to prove that all accesses 
to resources were authorized. Also, since we store both the authorization policy and the token 
catalog on-chain, their updates leave a permanent trace. Hence, not only is it possible to verify 
the enforcement of the most up-to-date policy, but also past versions of the token catalog can be 
checked for verifying the correct enforcement of a former one. Also, since resources are not stored 
on-chain, they can always be deleted (also in accordance to the EU GDPR). 

We close this section with a note on the generality of our solution. Our proposal does not rely 
on specific technologies or architectures, and hence can be easily tailored and deployed in different 
application scenarios. For instance, resource protection through selective encryption (Section 3.4) 
can operate with arbitrary (sufficiently strong) encryption schemes. Similarly, we do not restrict 
our smart contracts to operate on a specific blockchain (e.g., Ethereum) or programming language 
for coding the smart contract. Of course, the security of the overall system depends on the correct- 
ness of the developed code, like in any interaction governed by a smart contract. 


3.7 Summary 


This chapter focused on the problem of enforcing controlled sharing through data wrapping in 
the data market scenario. The approach illustrated in this chapter leverages selective owner-side 
encryption to protect the confidentiality of the data outsourced to the market, and key derivation 
for allowing authorized parties to compute the keys necessary to unwrap and access plaintext 
resources. We also considered the management of incentives to data owners, and proposed a 
solution based on blockchain/smart contracts and an audit process for counteracting possible mis- 
behaviors. MOSAICrOWN is now studying the definition of novel strategies for managing keys 
and key derivation. 
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4. Conclusions 


This document presented the results of research work performed in MOSAICrOWN Work Pack- 
age 4 devoted to the design and development of advanced encryption-based solutions for wrapping 
data ingested, stored, processed, or shared in the digital data market. The goal is to provide ef- 
ficient solutions enjoying strong protection guarantees while preserving access and processing 
functionality. The techniques presented in this deliverable address the challenges entailed by such 
a goal tackling different problems related to ensuring data protection and functionality in the dif- 
ferent phases of the information life cycle. They enable data owners, curators, and processors to 
properly enjoy functionality over the data while ensuring proper data protection against non autho- 
rized users or misbehaving parties. The modular techniques presented in this deliverable represent 
building blocks towards the realization of a digital data market empowering data owners, in com- 
pliance with data protection regulations and policies. 

The techniques presented in Chapter [I] enable owners to enjoy availability of the data mar- 
ket for processing data in collaborative computations, possibly leveraging economical solutions 
and involving external parties in the computation, while ensuring data are not leaked (directly or 
indirectly) to non authorized parties. Such protection is realized by the controlled ingestion of en- 
cryption/decryption operations in the execution of query plans, as needed to cover the data against 
unauthorized parties to which parts of the computation are assigned. 

The techniques presented in Chapter [2] offer an advanced All-Or-Nothing-Transform (AONT) 
encryption providing for strong protection guarantees even in cases where encryption key may 
be leaked. They also enable the efficient realization of revocation for large resources without 
requiring their complete re-encryption. The chapter also presents our approach to enrich the man- 
agement of AONT-encrypted resources leveraging the availability of decentralized data storage. 
Our solution provides for controlled slicing and allocation of resources, enabling owners to set the 
parameters regulating them as it best suits their needs. 

The techniques presented in Chapter B]enable data owners to control encryption when ingest- 
ing data in the data market, wrapping them in a protection layer that can be accessible only to 
authorized consumers. The consideration of blockchain and smart contracts provides owners with 
full control over their resources, permitting access to them against payment, towards the realiza- 
tion of a digital data market supporting economic incentives. 
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